Packages and Installation of TensorFlow

Install and load required packages.

if(!require('tensorflow')){install.packages('tensorflow')} # Package connects to TensorFlow
library(tensorflow)
#install_tensorflow() # Install TensorFlow if not already previously installed; requires a working installation of anaconda; you will be prompted in the console to download a version of anaconda prior if not already installed. Additionally, coding syntax requires TensorFlow version 2 and above.
if(!require('keras')){install.packages('keras')} # Use keras for neural network creation; interfaces to the Python Keras API which can run over TensorFlow
library(keras)
if(!require('caret')){install.packages('caret')} # Use to create folds for cross validation and obtain confusion matrix
library(caret)
if(!require('e1071')){install.packages('e1071')} # Apparently also required for confusion matrix and related measures (precision, recall, F1)
library(e1071)
if(!require('ROSE')){install.packages('ROSE')} # For oversampling
library(ROSE)
if(!require('pROC')){install.packages('pROC')} # For ROC/AUC
library(pROC)
if(!require('class')){install.packages('class')} # For knn function
library(class)
if(!require('randomForest')){install.packages('randomForest')} # For randomForest function
library(randomForest)
if(!require('ggplot2')){install.packages('ggplot2')} # For visualizations
library(ggplot2)

Load Data

First we will load the dataset that is ready for modeling.

db.model <- read.csv('/Users/amyhowe/Desktop/Dataset_Modeling.csv')

Create Folds

Next we will create 10 stratified folds. I.e., each of the 10 folds will have a similar proportion of the class attribute (readmitted) so that they are representative of the dataset overall: around 9% readmitted and 91% not readmitted.

set.seed(42) # Make results reproducible
folds <- createFolds(y = db.model$readmitted, k = 10, list = FALSE) # Stratified sampling to use for cross-validation
db.folds <- db.model
db.folds$folds <- folds # Create dataframe with the fold numbers as a column so they may be used together going forward
for (f in 1:10){
  print(prop.table(table(db.folds$readmitted[folds == f])))
} # All folds are close to 91% for not readmitted (0), and 9% for readmitted (1), which is close to the proportions in the overall dataset
## 
##          0          1 
## 0.91040297 0.08959703 
## 
##          0          1 
## 0.91496356 0.08503644 
## 
##          0          1 
## 0.91253394 0.08746606 
## 
##          0          1 
## 0.91210519 0.08789481 
## 
##          0          1 
## 0.91240354 0.08759646 
## 
##          0          1 
## 0.90867515 0.09132485 
## 
##          0          1 
## 0.90881806 0.09118194 
## 
##          0          1 
## 0.90838931 0.09161069 
## 
##          0          1 
## 0.90553094 0.09446906 
## 
##          0          1 
## 0.90911689 0.09088311

Oversample to Balance Target Attribute

Next we will randomly oversample the minority class in the target attribute. We want the classes to be approximately equal in each fold so that we may avoid model bias that favours the majority class. Oversampling was chosen over undersampling so that no information may be lost.

db.over <- db.folds
for (f in 1:10) {
x <- ovun.sample(readmitted~., data = db.over[db.over$folds == f,], method = 'over', seed = 38)
y <- db.over[!(db.over$folds == f),]
db.over <- rbind(y, x$data)
} # For each fold, randomly oversample the minority class so that the target attribute (readmitted) may be balanced prior to modeling, and save the rows back into the dataframe db.over

table(db.over$folds, db.over$readmitted) # Each fold now has approximately equal numbers of cases in each category of the readmitted attribute so that we may avoid model bias towards the majority class
##     
##         0    1
##   1  6371 6392
##   2  6402 6421
##   3  6385 6407
##   4  6382 6406
##   5  6385 6407
##   6  6358 6377
##   7  6359 6378
##   8  6356 6375
##   9  6336 6349
##   10 6362 6384
dim(db.over) # We now have 127592 cases overall, which is close to what is expected. The folds column represents the 83rd variable
## [1] 127592     83

Now we will convert the dataframe into a matrix, so that it may be used as an input to the neural network algorithms.

db <- as.matrix(db.over)

Neural Network Model Creation and Evaluation

For each model parameter, we will try creating a neural network with just one train/test split to get a sense of any issues (as it is less computationally intensive than 10-fold cross validation). These will be noted as “test models” (‘tmodel’ for the object), as the configuration will be tested before moving to 10-fold cross valdidation to confirm the optimal parameter.

First we will set a seed that applies to all TensorFlow operations so that results are reproducible.

tensorflow::tf$random$set_seed(31) # If you receive an error indicating that the installation of TensorFlow is not found even though it was installed, just re-run this chunk and it should work

This first test neural network model will have 2 hidden layers with 81 neurons each (same as input) with Relu activation functions; the output layer will be one neuron with the sigmoid function so that answers will be expressed between 0 and 1. We’ll start with 20 epochs to be mindful of computation. We will start with the Binary Cross-Entropy loss function and the Adam optimizer with their default settings. The training batch size will be left at its default: 32. We will add the test set in as ‘validation data’ so that we can plot its accuracy and loss alongside the training set by epoch to help identify overfitting and other issues.

# Create a sequential model (composed of linear layers)
tmodel1 <- keras_model_sequential()

# Create layers
tmodel1 %>%
  layer_dense(units = 81, activation = 'relu', input_shape = c(81)) %>% # First hidden layer with definition of the input layer embedded
  layer_dense(units = 81, activation = 'relu') %>% # Second hidden layer
  layer_dense(units = 1, activation = 'sigmoid') # Output layer

# Compile the model
tmodel1 %>% compile(
  loss = 'binary_crossentropy', # Loss function
  optimizer = 'adam', # Optimizer
  metrics = 'accuracy'
 )

# Fit model to data
training <- tmodel1 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20, # Number of epochs (i.e., number of times the training goes through the whole training dataset)
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]), # The test set will be added in here as the 'validation data' so that we may see its accuracy and loss after each epoch. This will be excluded when running 10-fold cross validation
  verbose = 0 # We don't need to see the loading bars
 )

# Plot the training
plot(training) # The training loss is gradually decreasing with the accuracy increasing. However, the validation (test) loss is increasing with its accuracy continually decreasing. This indicates that the model is overfitting on the training data. The model is likely just memorizing the training data, particularly since each hidden layer has the same number of inputs as the input layer. This obviously does not support accurate prediction on the validation set

# Predict classes for test data
tmod1class <- tmodel1 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation measures
tmod1result <- confusionMatrix(as.factor(tmod1class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod1result # Accuracy on the test set is 0.5391, which is very low in comparison to the 0.8820 accuracy of the training set. Also, Recall is very low at 0.2825
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5074 4586
##          1 1297 1806
##                                           
##                Accuracy : 0.5391          
##                  95% CI : (0.5304, 0.5477)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.0789          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5820          
##                  Recall : 0.2825          
##                      F1 : 0.3804          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1415          
##    Detection Prevalence : 0.2431          
##       Balanced Accuracy : 0.5395          
##                                           
##        'Positive' Class : 1               
## 
tmod1roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod1class))
auc(tmod1roc) # AUC = 0.5395, which is not much better than randomly guessing the class attribute (AUC - 0.50)
## Area under the curve: 0.5395

Even though the first model did not perform optimally, let’s try the same model parameters using 10-fold cross validation to ensure the code works and view the evaluation measures.

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval1 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model1 <- keras_model_sequential()
  
  # Layers
  model1 %>%
    layer_dense(units = 81, activation = 'relu', input_shape = c(81)) %>%
    layer_dense(units = 81, activation = 'relu') %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model1 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model1 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 20, 
    verbose = 0
  )

  # Predict classes for test data
  mod1class <- model1 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod1result <- confusionMatrix(as.factor(mod1class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval1$accuracy[f] <- mod1result$overall['Accuracy']
  eval1$precision[f] <- mod1result$byClass['Precision']
  eval1$recall[f] <- mod1result$byClass['Recall']
  eval1$f1[f] <- mod1result$byClass['F1']
  mod1roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod1class))
  eval1$auc[f] <- auc(mod1roc)
}

eval1 # Dataframe of each evaluation metric (accuracy, precision, recall, f1, AUC) for each fold (1-10)
##     accuracy precision    recall        f1       auc
## 1  0.5162579 0.5387074 0.2373279 0.3294961 0.5167176
## 2  0.5462060 0.6123973 0.2554119 0.3604792 0.5466375
## 3  0.5385397 0.5850202 0.2706415 0.3700779 0.5390012
## 4  0.5365968 0.5731707 0.2934749 0.3881891 0.5370539
## 5  0.5382270 0.5775434 0.2906196 0.3866681 0.5386536
## 6  0.5104829 0.5226337 0.2588992 0.3462668 0.5108588
## 7  0.5475387 0.6051282 0.2775165 0.3805224 0.5479421
## 8  0.5459901 0.6015012 0.2765490 0.3788953 0.5463928
## 9  0.5397714 0.5834150 0.2814616 0.3797280 0.5400364
## 10 0.5285580 0.5641903 0.2581454 0.3542182 0.5290255
summary(eval1) # The mean of each evaluation metric is: Accuracy: 0.5348, Precision: 0.5764, Recall: 0.2700, f1: 0.3675, AUC: 0.5352. This is similar to what we found on the test model.
##     accuracy        precision          recall             f1        
##  Min.   :0.5105   Min.   :0.5226   Min.   :0.2373   Min.   :0.3295  
##  1st Qu.:0.5306   1st Qu.:0.5664   1st Qu.:0.2583   1st Qu.:0.3558  
##  Median :0.5384   Median :0.5805   Median :0.2736   Median :0.3745  
##  Mean   :0.5348   Mean   :0.5764   Mean   :0.2700   Mean   :0.3675  
##  3rd Qu.:0.5444   3rd Qu.:0.5974   3rd Qu.:0.2805   3rd Qu.:0.3803  
##  Max.   :0.5475   Max.   :0.6124   Max.   :0.2935   Max.   :0.3882  
##       auc        
##  Min.   :0.5109  
##  1st Qu.:0.5310  
##  Median :0.5388  
##  Mean   :0.5352  
##  3rd Qu.:0.5448  
##  Max.   :0.5479

Now that we have a baseline to improve, and we know the code is working, we can try to improve the model. First, we’re going to try to improve the structure of the model: the number of hidden layers and the number of neurons per hidden layer.

Structure with One Hidden Layer

Let’s try a number of neurons in the hidden layer using a commonly used rule: the number of neurons should be equal to the square root of the input layer, plus one of 1-10. The square root of 81 is 9. We will try 16 (9 + 7) to see if this may be a viable structure. Then we’ll try a few other values as well.

Test Model 2: 1 Hidden Layer with 16 Neurons

# Create model
tmodel2 <- keras_model_sequential()

# Create layers
tmodel2 %>%
  layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel2 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training2 <- tmodel2 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training2) # Training accuracy is closer to test accuracy this time

# Predict classes for test data
tmod2class <- tmodel2 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod2result <- confusionMatrix(as.factor(tmod2class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod2result # Test set Accuracy is 0.6053; Precision is 0.6074, Recall is much improved, at 0.5997, with F1 = 0.6035
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3893 2559
##          1 2478 3833
##                                           
##                Accuracy : 0.6053          
##                  95% CI : (0.5968, 0.6138)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.2107          
##                                           
##  Mcnemar's Test P-Value : 0.2597          
##                                           
##               Precision : 0.6074          
##                  Recall : 0.5997          
##                      F1 : 0.6035          
##              Prevalence : 0.5008          
##          Detection Rate : 0.3003          
##    Detection Prevalence : 0.4945          
##       Balanced Accuracy : 0.6054          
##                                           
##        'Positive' Class : 1               
## 
tmod2roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod2class))
auc(tmod2roc) # AUC = 0.6054
## Area under the curve: 0.6054

Test Model 3: 1 Hidden Layer with 32 Neurons

# Create model
tmodel3 <- keras_model_sequential()

# Create layers
tmodel3 %>%
  layer_dense(units = 32, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel3 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training3 <- tmodel3 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training3) 

# Predict classes for test data
tmod3class <- tmodel3 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod3result <- confusionMatrix(as.factor(tmod3class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod3result # Test set accuracy is 0.569 - less than the last one. Recall is worse than the last one (0.4983), reducing the F1 score as well
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4077 3207
##          1 2294 3185
##                                           
##                Accuracy : 0.569           
##                  95% CI : (0.5603, 0.5776)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1382          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5813          
##                  Recall : 0.4983          
##                      F1 : 0.5366          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2495          
##    Detection Prevalence : 0.4293          
##       Balanced Accuracy : 0.5691          
##                                           
##        'Positive' Class : 1               
## 
tmod3roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod3class))
auc(tmod3roc) # AUC = 0.5691
## Area under the curve: 0.5691

Let’s also try with the number of neurons in the hidden layer being about half the number from the input layer.

Test Model 4: 1 Hidden Layer with 41 Neurons

# Create model
tmodel4 <- keras_model_sequential()

# Create layers
tmodel4 %>%
  layer_dense(units = 41, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel4 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training4 <- tmodel4 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training4) # Similar performance to last attempt

# Predict classes for test data
tmod4class <- tmodel4 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod4result <- confusionMatrix(as.factor(tmod4class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod4result # Test set accuracy is 0.5709, with recall at 0.5027
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4074 3179
##          1 2297 3213
##                                           
##                Accuracy : 0.5709          
##                  95% CI : (0.5623, 0.5796)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1421          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5831          
##                  Recall : 0.5027          
##                      F1 : 0.5399          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2517          
##    Detection Prevalence : 0.4317          
##       Balanced Accuracy : 0.5711          
##                                           
##        'Positive' Class : 1               
## 
tmod4roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod4class))
auc(tmod4roc) # AUC = 0.5711
## Area under the curve: 0.5711

Test Model 5: 1 Hidden Layer with 64 Neurons

# Create model
tmodel5 <- keras_model_sequential()

# Create layers
tmodel5 %>%
  layer_dense(units = 64, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel5 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training5 <- tmodel5 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training5) 

# Predict classes for test data
tmod5class <- tmodel5 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod5result <- confusionMatrix(as.factor(tmod5class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod5result # Test set accuracy is 0.5576. Recall is much lower at 0.4316
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4358 3633
##          1 2013 2759
##                                          
##                Accuracy : 0.5576         
##                  95% CI : (0.549, 0.5663)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.1156         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##               Precision : 0.5782         
##                  Recall : 0.4316         
##                      F1 : 0.4943         
##              Prevalence : 0.5008         
##          Detection Rate : 0.2162         
##    Detection Prevalence : 0.3739         
##       Balanced Accuracy : 0.5578         
##                                          
##        'Positive' Class : 1              
## 
tmod5roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod5class))
auc(tmod5roc) # AUC = 0.5578
## Area under the curve: 0.5578

Test Model 6: 1 Hidden Layer with 128 Neurons

# Create model
tmodel6 <- keras_model_sequential()

# Create layers
tmodel6 %>%
  layer_dense(units = 128, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel6 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training6 <- tmodel6 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training6) # The training set has an artificially high accuracy here too, around 0.8 - similar to the 81/81 test model

# Predict classes for test data
tmod6class <- tmodel6 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod6result <- confusionMatrix(as.factor(tmod6class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod6result # Test set accuracy is 0.5448. Recall low at 0.3712
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4580 4019
##          1 1791 2373
##                                           
##                Accuracy : 0.5448          
##                  95% CI : (0.5361, 0.5534)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.0901          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5699          
##                  Recall : 0.3712          
##                      F1 : 0.4496          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1859          
##    Detection Prevalence : 0.3263          
##       Balanced Accuracy : 0.5451          
##                                           
##        'Positive' Class : 1               
## 
tmod6roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod6class))
auc(tmod6roc) # AUC = 0.5451
## Area under the curve: 0.5451

So far, with 1 hidden layer, the best performing model on accuracy and AUC is Test Model 2 with 16 neurons. We’ll retrain the model using 10-fold cross validation to obtain the average evaluation metrics to compare to configurations with other numbers of hidden layers.

Cross Validation for Model 2: 1 Hidden Layer with 16 Neurons

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval2 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model2 <- keras_model_sequential()
  
  # Layers
  model2 %>%
    layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model2 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model2 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 20, 
    verbose = 0
  )

  # Predict classes for test data
  mod2class <- model2 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod2result <- confusionMatrix(as.factor(mod2class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval2$accuracy[f] <- mod2result$overall['Accuracy']
  eval2$precision[f] <- mod2result$byClass['Precision']
  eval2$recall[f] <- mod2result$byClass['Recall']
  eval2$f1[f] <- mod2result$byClass['F1']
  mod2roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod2class))
  eval2$auc[f] <- auc(mod2roc)
}

eval2
##     accuracy precision    recall        f1       auc
## 1  0.5898300 0.6032851 0.5286295 0.5634954 0.5899308
## 2  0.5783358 0.5795669 0.5751441 0.5773470 0.5783405
## 3  0.5862258 0.6178587 0.4557515 0.5245666 0.5864505
## 4  0.5887551 0.6084736 0.5021855 0.5502437 0.5889179
## 5  0.5619137 0.5912293 0.4061183 0.4814952 0.5621821
## 6  0.5583824 0.5723343 0.4671476 0.5144189 0.5585187
## 7  0.5948811 0.6014662 0.5660082 0.5831987 0.5949242
## 8  0.5916267 0.5957655 0.5738039 0.5845785 0.5916534
## 9  0.5713047 0.5857331 0.4901559 0.5336992 0.5713879
## 10 0.5773576 0.6019219 0.4611529 0.5222173 0.5775585
summary(eval2) # The mean of each evaluation metric is: Accuracy: 0.5799, Precision: 0.5958, Recall: 0.5026, f1: 0.5435, AUC: 0.5800. This is an improvement from our original baseline cross validation on the 81/81 network
##     accuracy        precision          recall             f1        
##  Min.   :0.5584   Min.   :0.5723   Min.   :0.4061   Min.   :0.4815  
##  1st Qu.:0.5728   1st Qu.:0.5871   1st Qu.:0.4627   1st Qu.:0.5228  
##  Median :0.5823   Median :0.5986   Median :0.4962   Median :0.5420  
##  Mean   :0.5799   Mean   :0.5958   Mean   :0.5026   Mean   :0.5435  
##  3rd Qu.:0.5896   3rd Qu.:0.6029   3rd Qu.:0.5567   3rd Qu.:0.5739  
##  Max.   :0.5949   Max.   :0.6179   Max.   :0.5751   Max.   :0.5846  
##       auc        
##  Min.   :0.5585  
##  1st Qu.:0.5729  
##  Median :0.5824  
##  Mean   :0.5800  
##  3rd Qu.:0.5897  
##  Max.   :0.5949

Let’s try some structures with 2 hidden layers.

Structure with 2 Hidden Layers

We’ll try some structures with 2 hidden layers with an equal number of neurons using the numbers we tried for testing with 1 hidden layer. We will also test a few structures with a larger and then smaller number of neurons for the 2 hidden layers. Of note, for any network with more than one hidden layer, I will express the number of neurons per hidden layer as x/y/…, with x being the number of neurons in the first hidden layer, y being the number of neurons in the second hidden layer, and so on.

Test Model 7: 2 Hidden Layers; 16/16

# Create model
tmodel7 <- keras_model_sequential()

# Create layers
tmodel7 %>%
  layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>% 
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel7 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training7 <- tmodel7 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training7) # Still overfitting with validation loss and accuracy going in the opposite directions from what we want

# Predict classes for test data
tmod7class <- tmodel7 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod7result <- confusionMatrix(as.factor(tmod7class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod7result # Test set accuracy is 0.5566. Recall at 0.5114 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3835 3123
##          1 2536 3269
##                                           
##                Accuracy : 0.5566          
##                  95% CI : (0.5479, 0.5653)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1133          
##                                           
##  Mcnemar's Test P-Value : 6.71e-15        
##                                           
##               Precision : 0.5631          
##                  Recall : 0.5114          
##                      F1 : 0.5360          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2561          
##    Detection Prevalence : 0.4548          
##       Balanced Accuracy : 0.5567          
##                                           
##        'Positive' Class : 1               
## 
tmod7roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod7class))
auc(tmod7roc) # AUC = 0.5567
## Area under the curve: 0.5567

Test Model 8: 2 Hidden Layers; 32/32

# Create model
tmodel8 <- keras_model_sequential()

# Create layers
tmodel8 %>%
  layer_dense(units = 32, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 32, activation = 'relu') %>% 
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel8 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training8 <- tmodel8 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training8)

# Predict classes for test data
tmod8class <- tmodel8 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod8result <- confusionMatrix(as.factor(tmod8class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod8result # Test set accuracy is 0.5597. Recall much lower at 0.4182
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4471 3719
##          1 1900 2673
##                                           
##                Accuracy : 0.5597          
##                  95% CI : (0.5511, 0.5684)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1199          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5845          
##                  Recall : 0.4182          
##                      F1 : 0.4876          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2094          
##    Detection Prevalence : 0.3583          
##       Balanced Accuracy : 0.5600          
##                                           
##        'Positive' Class : 1               
## 
tmod8roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod8class))
auc(tmod8roc) # AUC = 0.56
## Area under the curve: 0.56

Test Model 9: 2 Hidden Layers; 41/41

# Create model
tmodel9 <- keras_model_sequential()

# Create layers
tmodel9 %>%
  layer_dense(units = 41, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 41, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel9 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training9 <- tmodel9 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training9)

# Predict classes for test data
tmod9class <- tmodel9 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod9result <- confusionMatrix(as.factor(tmod9class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod9result # Test set accuracy is 0.553. Recall very low at 0.3869
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4614 3919
##          1 1757 2473
##                                           
##                Accuracy : 0.5553          
##                  95% CI : (0.5466, 0.5639)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.111           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5846          
##                  Recall : 0.3869          
##                      F1 : 0.4656          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1938          
##    Detection Prevalence : 0.3314          
##       Balanced Accuracy : 0.5556          
##                                           
##        'Positive' Class : 1               
## 
tmod9roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod9class))
auc(tmod9roc) # AUC = 0.5556
## Area under the curve: 0.5556

Test Model 10: 2 Hidden Layers; 64/64

# Create model
tmodel10 <- keras_model_sequential()

# Create layers
tmodel10 %>%
  layer_dense(units = 64, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel10 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training10 <- tmodel10 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training10) # Training set has high accuracy (++ overfitting)

# Predict classes for test data
tmod10class <- tmodel10 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod10result <- confusionMatrix(as.factor(tmod10class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod10result # Test set accuracy is 0.5381. Recall very low at 0.3263
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4782 4306
##          1 1589 2086
##                                           
##                Accuracy : 0.5381          
##                  95% CI : (0.5294, 0.5468)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.0769          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5676          
##                  Recall : 0.3263          
##                      F1 : 0.4144          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1634          
##    Detection Prevalence : 0.2879          
##       Balanced Accuracy : 0.5385          
##                                           
##        'Positive' Class : 1               
## 
tmod10roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod10class))
auc(tmod10roc) # AUC = 0.5385
## Area under the curve: 0.5385

Test Model 11: 2 Hidden Layers; 128/128

# Create model
tmodel11 <- keras_model_sequential()

# Create layers
tmodel11 %>%
  layer_dense(units = 128, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel11 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training11 <- tmodel11 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training11)

# Predict classes for test data
tmod11class <- tmodel11 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod11result <- confusionMatrix(as.factor(tmod11class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod11result # Test set accuracy is 0.5261. Recall extremely low at 0.1600
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5692 5369
##          1  679 1023
##                                           
##                Accuracy : 0.5261          
##                  95% CI : (0.5174, 0.5348)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 5.627e-09       
##                                           
##                   Kappa : 0.0534          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.60106         
##                  Recall : 0.16004         
##                      F1 : 0.25278         
##              Prevalence : 0.50082         
##          Detection Rate : 0.08015         
##    Detection Prevalence : 0.13335         
##       Balanced Accuracy : 0.52673         
##                                           
##        'Positive' Class : 1               
## 
tmod11roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod11class))
auc(tmod11roc) # 0.5267
## Area under the curve: 0.5267

Test Model 12: 2 Hidden Layers; 32/16

# Create model
tmodel12 <- keras_model_sequential()

# Create layers
tmodel12 %>%
  layer_dense(units = 32, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel12 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training12 <- tmodel12 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training12)

# Predict classes for test data
tmod12class <- tmodel12 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod12result <- confusionMatrix(as.factor(tmod12class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod12result # Test set accuracy is 0.5578, with recall at 0.4449
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4275 3548
##          1 2096 2844
##                                           
##                Accuracy : 0.5578          
##                  95% CI : (0.5491, 0.5664)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1159          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5757          
##                  Recall : 0.4449          
##                      F1 : 0.5019          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2228          
##    Detection Prevalence : 0.3871          
##       Balanced Accuracy : 0.5580          
##                                           
##        'Positive' Class : 1               
## 
tmod12roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod12class))
auc(tmod12roc) # AUC = 0.558
## Area under the curve: 0.558

Test Model 13: 2 Hidden Layers; 64/32

# Create model
tmodel13 <- keras_model_sequential()

# Create layers
tmodel13 %>%
  layer_dense(units = 64, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 32, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel13 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training13 <- tmodel13 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training13)

# Predict classes for test data
tmod13class <- tmodel13 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod13result <- confusionMatrix(as.factor(tmod13class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod13result # Test set accuracy is 0.5548 with low recall at 0.3479
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4857 4168
##          1 1514 2224
##                                           
##                Accuracy : 0.5548          
##                  95% CI : (0.5461, 0.5635)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1102          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5950          
##                  Recall : 0.3479          
##                      F1 : 0.4391          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1743          
##    Detection Prevalence : 0.2929          
##       Balanced Accuracy : 0.5551          
##                                           
##        'Positive' Class : 1               
## 
tmod13roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod13class))
auc(tmod13roc) # AUC = 0.5551
## Area under the curve: 0.5551

Test Model 14: 2 Hidden Layers; 128/64

# Create model
tmodel14 <- keras_model_sequential()

# Create layers
tmodel14 %>%
  layer_dense(units = 128, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel14 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training14 <- tmodel14 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training14)

# Predict classes for test data
tmod14class <- tmodel14 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod14result <- confusionMatrix(as.factor(tmod14class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod14result # Test set accuracy is 0.5301; recall = 0.2071 (very low)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5442 5068
##          1  929 1324
##                                           
##                Accuracy : 0.5301          
##                  95% CI : (0.5214, 0.5388)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 1.867e-11       
##                                           
##                   Kappa : 0.0613          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5877          
##                  Recall : 0.2071          
##                      F1 : 0.3063          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1037          
##    Detection Prevalence : 0.1765          
##       Balanced Accuracy : 0.5307          
##                                           
##        'Positive' Class : 1               
## 
tmod14roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod14class))
auc(tmod14roc) # AUC = 0.5307
## Area under the curve: 0.5307

Of the structures with 2 hidden layers, the one with the 32/16 configuration (Test Model 12) performed the best in accuracy (0.5578) and AUC (0.558), narrowly beating out Test Model 7 (16/16), which had an accuracy of 0.5566 and an AUC of 0.5567. However, Test Model 7 had a much higher recall than Test Model 12 (0.5114 vs 0.4449). Given that we want to prioritize recall (i.e., we want the model to identify a high rate of the actual positives) we’ll retrain Test Model 7 using 10-fold cross validation to obtain the average evaluation metrics to compare to configurations with other numbers of hidden layers.

Cross Validation for Model 7: 2 Hidden Layers; 16/16

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval7 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model7 <- keras_model_sequential()
  
  # Layers
  model7 %>%
    layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model7 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model7 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 20, 
    verbose = 0
  )

  # Predict classes for test data
  mod7class <- model7 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod7result <- confusionMatrix(as.factor(mod7class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval7$accuracy[f] <- mod7result$overall['Accuracy']
  eval7$precision[f] <- mod7result$byClass['Precision']
  eval7$recall[f] <- mod7result$byClass['Recall']
  eval7$f1[f] <- mod7result$byClass['F1']
  mod7roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod7class))
  eval7$auc[f] <- auc(mod7roc)
}

eval7
##     accuracy precision    recall        f1       auc
## 1  0.5763535 0.5765584 0.5802566 0.5784016 0.5763471
## 2  0.5739687 0.5737831 0.5801277 0.5769380 0.5739595
## 3  0.5942777 0.6086413 0.5320743 0.5677881 0.5943848
## 4  0.5864091 0.5947735 0.5471433 0.5699650 0.5864830
## 5  0.5710600 0.5884615 0.4776026 0.5272680 0.5712210
## 6  0.5579898 0.5606552 0.5421044 0.5512238 0.5580135
## 7  0.5746251 0.5880088 0.5028222 0.5420892 0.5747324
## 8  0.5669625 0.5716900 0.5391373 0.5549366 0.5670041
## 9  0.5675995 0.5729730 0.5342574 0.5529383 0.5676337
## 10 0.5756316 0.5853018 0.5239662 0.5529383 0.5757209
summary(eval7) # The mean of each evaluation metric is: Accuracy: 0.5745, Precision: 0.5821, Recall: 0.5359, f1: 0.5574, AUC: 0.5746
##     accuracy        precision          recall             f1        
##  Min.   :0.5580   Min.   :0.5607   Min.   :0.4776   Min.   :0.5273  
##  1st Qu.:0.5685   1st Qu.:0.5732   1st Qu.:0.5260   1st Qu.:0.5517  
##  Median :0.5743   Median :0.5809   Median :0.5367   Median :0.5539  
##  Mean   :0.5745   Mean   :0.5821   Mean   :0.5359   Mean   :0.5574  
##  3rd Qu.:0.5762   3rd Qu.:0.5883   3rd Qu.:0.5459   3rd Qu.:0.5694  
##  Max.   :0.5943   Max.   :0.6086   Max.   :0.5803   Max.   :0.5784  
##       auc        
##  Min.   :0.5580  
##  1st Qu.:0.5685  
##  Median :0.5743  
##  Mean   :0.5746  
##  3rd Qu.:0.5762  
##  Max.   :0.5944

Let’s try some structures with 3 hidden layers.

Structure with 3 Hidden Layers

Now we will try a few configurations with 3 hidden layers. We’ll use the same numbers as the previous layer, with the same number of neurons per each layer. Then, we will try having a lower number, then higher number, then lower number.

Test Model 15: 3 Hidden Layers; 16/16/16

# Create model
tmodel15 <- keras_model_sequential()

# Create layers
tmodel15 %>%
  layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel15 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training15 <- tmodel15 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training15) # Not overfitting as much as some of the previous iterations

# Predict classes for test data
tmod15class <- tmodel15 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod15result <- confusionMatrix(as.factor(tmod15class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod15result # Test set accuracy is 0.5724. Recall stayed at a reasonable value: 0.5383
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3865 2951
##          1 2506 3441
##                                          
##                Accuracy : 0.5724         
##                  95% CI : (0.5638, 0.581)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.145          
##                                          
##  Mcnemar's Test P-Value : 1.85e-09       
##                                          
##               Precision : 0.5786         
##                  Recall : 0.5383         
##                      F1 : 0.5577         
##              Prevalence : 0.5008         
##          Detection Rate : 0.2696         
##    Detection Prevalence : 0.4660         
##       Balanced Accuracy : 0.5725         
##                                          
##        'Positive' Class : 1              
## 
tmod15roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod15class))
auc(tmod15roc) # AUC = 0.5725
## Area under the curve: 0.5725

Test Model 16: 3 Hidden Layers; 32/32/32

# Create model
tmodel16 <- keras_model_sequential()

# Create layers
tmodel16 %>%
  layer_dense(units = 32, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 32, activation = 'relu') %>%
  layer_dense(units = 32, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel16 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training16 <- tmodel16 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training16)

# Predict classes for test data
tmod16class <- tmodel16 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod16result <- confusionMatrix(as.factor(tmod16class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod16result # Test set accuracy is 0.5795; recall is a bit lower at 0.4745
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4363 3359
##          1 2008 3033
##                                           
##                Accuracy : 0.5795          
##                  95% CI : (0.5709, 0.5881)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1593          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6017          
##                  Recall : 0.4745          
##                      F1 : 0.5306          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2376          
##    Detection Prevalence : 0.3950          
##       Balanced Accuracy : 0.5797          
##                                           
##        'Positive' Class : 1               
## 
tmod16roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod16class))
auc(tmod16roc) # AUC = 0.5797
## Area under the curve: 0.5797

Test Model 17: 3 Hidden Layers; 41/41/41

# Create model
tmodel17 <- keras_model_sequential()

# Create layers
tmodel17 %>%
  layer_dense(units = 41, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 41, activation = 'relu') %>%
  layer_dense(units = 41, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel17 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training17 <- tmodel17 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training17)

# Predict classes for test data
tmod17class <- tmodel17 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod17result <- confusionMatrix(as.factor(tmod17class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod17result # Test set accuracy is 0.5488. Recall low at 0.3598
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4704 4092
##          1 1667 2300
##                                           
##                Accuracy : 0.5488          
##                  95% CI : (0.5401, 0.5574)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.0981          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5798          
##                  Recall : 0.3598          
##                      F1 : 0.4441          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1802          
##    Detection Prevalence : 0.3108          
##       Balanced Accuracy : 0.5491          
##                                           
##        'Positive' Class : 1               
## 
tmod17roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod17class))
auc(tmod17roc) # AUC = 0.5491
## Area under the curve: 0.5491

Test Model 18: 3 Hidden Layers; 64/64/64

# Create model
tmodel18 <- keras_model_sequential()

# Create layers
tmodel18 %>%
  layer_dense(units = 64, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel18 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training18 <- tmodel18 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training18)

# Predict classes for test data
tmod18class <- tmodel18 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod18result <- confusionMatrix(as.factor(tmod18class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod18result # Test set accuracy is 0.5427. Recall very low at 0.2692
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5205 4671
##          1 1166 1721
##                                          
##                Accuracy : 0.5427         
##                  95% CI : (0.534, 0.5513)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.0861         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##               Precision : 0.5961         
##                  Recall : 0.2692         
##                      F1 : 0.3709         
##              Prevalence : 0.5008         
##          Detection Rate : 0.1348         
##    Detection Prevalence : 0.2262         
##       Balanced Accuracy : 0.5431         
##                                          
##        'Positive' Class : 1              
## 
tmod18roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod18class))
auc(tmod18roc) # AUC = 0.5431
## Area under the curve: 0.5431

Test Model 19: 3 Hidden Layers; 128/128/128

# Create model
tmodel19 <- keras_model_sequential()

# Create layers
tmodel19 %>%
  layer_dense(units = 128, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel19 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training19 <- tmodel19 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training19)

# Predict classes for test data
tmod19class <- tmodel19 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod19result <- confusionMatrix(as.factor(tmod19class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod19result # Test set accuracy is 0.5243. Recall extremely low at 0.1554
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5699 5399
##          1  672  993
##                                          
##                Accuracy : 0.5243         
##                  95% CI : (0.5156, 0.533)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : 5.689e-08      
##                                          
##                   Kappa : 0.0498         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##               Precision : 0.5964         
##                  Recall : 0.1554         
##                      F1 : 0.2465         
##              Prevalence : 0.5008         
##          Detection Rate : 0.0778         
##    Detection Prevalence : 0.1305         
##       Balanced Accuracy : 0.5249         
##                                          
##        'Positive' Class : 1              
## 
tmod19roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod19class))
auc(tmod19roc) # AUC = 0.5249
## Area under the curve: 0.5249

Test Model 20: 3 Hidden Layers; 16/32/16

# Create model
tmodel20 <- keras_model_sequential()

# Create layers
tmodel20 %>%
  layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 32, activation = 'relu') %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel20 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training20 <- tmodel20 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training20)

# Predict classes for test data
tmod20class <- tmodel20 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod20result <- confusionMatrix(as.factor(tmod20class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod20result # Test set accuracy is 0.5764; Recall higher at 0.5753
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3667 2715
##          1 2704 3677
##                                          
##                Accuracy : 0.5754         
##                  95% CI : (0.5668, 0.584)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.1508         
##                                          
##  Mcnemar's Test P-Value : 0.8919         
##                                          
##               Precision : 0.5762         
##                  Recall : 0.5753         
##                      F1 : 0.5757         
##              Prevalence : 0.5008         
##          Detection Rate : 0.2881         
##    Detection Prevalence : 0.5000         
##       Balanced Accuracy : 0.5754         
##                                          
##        'Positive' Class : 1              
## 
tmod20roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod20class))
auc(tmod20roc) # AUC = 0.5754
## Area under the curve: 0.5754

Test Model 21: 3 Hidden Layers; 32/64/32

# Create model
tmodel21 <- keras_model_sequential()

# Create layers
tmodel21 %>%
  layer_dense(units = 32, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 32, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel21 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training21 <- tmodel21 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training21)

# Predict classes for test data
tmod21class <- tmodel21 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod21result <- confusionMatrix(as.factor(tmod21class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod21result # Test set accuracy is 0.54. Recall low at 0.3151
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4878 4378
##          1 1493 2014
##                                           
##                Accuracy : 0.54            
##                  95% CI : (0.5313, 0.5487)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.0807          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5743          
##                  Recall : 0.3151          
##                      F1 : 0.4069          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1578          
##    Detection Prevalence : 0.2748          
##       Balanced Accuracy : 0.5404          
##                                           
##        'Positive' Class : 1               
## 
tmod21roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod21class))
auc(tmod21roc) # AUC = 0.5404
## Area under the curve: 0.5404

Test Model 22: 3 Hidden Layers; 64/128/64

# Create model
tmodel22 <- keras_model_sequential()

# Create layers
tmodel22 %>%
  layer_dense(units = 64, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel22 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training22 <- tmodel22 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  verbose = 0
 )

# Plot training
plot(training22)

# Predict classes for test data
tmod22class <- tmodel22 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod22result <- confusionMatrix(as.factor(tmod22class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod22result # Test set accuracy is 0.531. Recall low at 0.2004
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5496 5111
##          1  875 1281
##                                           
##                Accuracy : 0.531           
##                  95% CI : (0.5223, 0.5397)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 4.909e-12       
##                                           
##                   Kappa : 0.063           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5942          
##                  Recall : 0.2004          
##                      F1 : 0.2997          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1004          
##    Detection Prevalence : 0.1689          
##       Balanced Accuracy : 0.5315          
##                                           
##        'Positive' Class : 1               
## 
tmod22roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod22class))
auc(tmod22roc) # AUC = 0.5315
## Area under the curve: 0.5315

Of the structures with 3 hidden layers, the one with the 16/32/16 configuration (Test Model 20) performed the best in accuracy, AUC, and recall. I will test this model using 10-fold cross validation to compare to the other configurations.

Cross Validation for Model 20: 3 Hidden Layers; 16/32/16

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval20 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model20 <- keras_model_sequential()
  
  # Layers
  model20 %>%
    layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
    layer_dense(units = 32, activation = 'relu') %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model20 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model20 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 20, 
    verbose = 0
  )

  # Predict classes for test data
  mod20class <- model20 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod20result <- confusionMatrix(as.factor(mod20class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval20$accuracy[f] <- mod20result$overall['Accuracy']
  eval20$precision[f] <- mod20result$byClass['Precision']
  eval20$recall[f] <- mod20result$byClass['Recall']
  eval20$f1[f] <- mod20result$byClass['F1']
  mod20roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod20class))
  eval20$auc[f] <- auc(mod20roc)
}

eval20
##     accuracy precision    recall        f1       auc
## 1  0.5864609 0.5966343 0.5380163 0.5658111 0.5865407
## 2  0.5503392 0.5543749 0.5200125 0.5366442 0.5503842
## 3  0.5640244 0.5781839 0.4790073 0.5239437 0.5641709
## 4  0.5584923 0.5774246 0.4423978 0.5009722 0.5587106
## 5  0.5663696 0.5894715 0.4421726 0.5053063 0.5665836
## 6  0.5581468 0.5637322 0.5201505 0.5410652 0.5582036
## 7  0.5750962 0.5735160 0.5907808 0.5820204 0.5750727
## 8  0.5739533 0.5794486 0.5440000 0.5611650 0.5739981
## 9  0.5712259 0.5774864 0.5340999 0.5549464 0.5712639
## 10 0.5696689 0.5838776 0.4901316 0.5329132 0.5698064
summary(eval20) # The mean of each evaluation metric is: Accuracy: 0.5674, Precision: 0.5774, Recall: 0.5101, f1: 0.5405, AUC: 0.5675
##     accuracy        precision          recall             f1        
##  Min.   :0.5503   Min.   :0.5544   Min.   :0.4422   Min.   :0.5010  
##  1st Qu.:0.5599   1st Qu.:0.5745   1st Qu.:0.4818   1st Qu.:0.5262  
##  Median :0.5680   Median :0.5778   Median :0.5201   Median :0.5389  
##  Mean   :0.5674   Mean   :0.5774   Mean   :0.5101   Mean   :0.5405  
##  3rd Qu.:0.5733   3rd Qu.:0.5828   3rd Qu.:0.5370   3rd Qu.:0.5596  
##  Max.   :0.5865   Max.   :0.5966   Max.   :0.5908   Max.   :0.5820  
##       auc        
##  Min.   :0.5504  
##  1st Qu.:0.5601  
##  Median :0.5682  
##  Mean   :0.5675  
##  3rd Qu.:0.5733  
##  Max.   :0.5865

Let’s compare the four model configurations that had 10-fold cross validation run on them.

# 1 hidden layer: 16
summary(eval2)[4,] # Had the highest mean accuracy, precision, and AUC
##           accuracy          precision             recall                 f1 
## "Mean   :0.5799  " "Mean   :0.5958  " "Mean   :0.5026  " "Mean   :0.5435  " 
##                auc 
## "Mean   :0.5800  "
# 2 hidden layers: 16/16
summary(eval7)[4,] # Had the highest recall and F1
##           accuracy          precision             recall                 f1 
## "Mean   :0.5745  " "Mean   :0.5821  " "Mean   :0.5359  " "Mean   :0.5574  " 
##                auc 
## "Mean   :0.5746  "
# 3 hidden layers: 16/32/16
summary(eval20)[4,]
##           accuracy          precision             recall                 f1 
## "Mean   :0.5674  " "Mean   :0.5774  " "Mean   :0.5101  " "Mean   :0.5405  " 
##                auc 
## "Mean   :0.5675  "

Model 2 (16 neurons) had the highest mean accuracy (0.5799), precision (0.5958), and AUC (0.5800) of these three models. Model 7 (16/16) had the highest mean recall (0.5359) and f1 score (0.5574). Model 7 also had the second highest mean accuracy (0.5745) and AUC (0.5746), both of which are less than 0.01 less than those of Model 2. Model 7 also has the second highest precision (0.5821). Given that we want to prioritize recall, i.e., having the model identify a high rate of truly positive cases as positive, and that the accuracy and AUC are close to the best performer in those respects, the configuration of Model 7 (16/16) will be selected to move forward with.

Batch size

The batch size is how many samples the network goes through before updating the weights and biases while training. Increasing the batch size can potentially help the optimizer escape local minima when trying to minimize the output of the loss function. We’re going to try increasing the batch size used for training from the default (32) to 64 to see if there is a change in performance (batch size located in ‘Fit model’).

Test Model 23: Batch Size 64

# Create model
tmodel23 <- keras_model_sequential()

# Create layers
tmodel23 %>%
  layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel23 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training23 <- tmodel23 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 64, # Updated
  verbose = 0
 )

# Plot training
plot(training23) # Doesn't appear much different than previous attempts

# Predict classes for test data
tmod23class <- tmodel23 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128) # Batch size here just relates to computation and doesn't affect outcome

# Evaluation
tmod23result <- confusionMatrix(as.factor(tmod23class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod23result # Test set accuracy is 0.58 - very slightly higher than before. Precision is 0.5909 (higher), Recall is 0.5241 (slightly reduced), with F1 0.5555. Precision is better but recall is worse than for a batch size of 32
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4052 3042
##          1 2319 3350
##                                           
##                Accuracy : 0.58            
##                  95% CI : (0.5713, 0.5885)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1601          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.5909          
##                  Recall : 0.5241          
##                      F1 : 0.5555          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2625          
##    Detection Prevalence : 0.4442          
##       Balanced Accuracy : 0.5800          
##                                           
##        'Positive' Class : 1               
## 
tmod23roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod23class))
auc(tmod23roc) # AUC = 0.58 - very slightly higher than before
## Area under the curve: 0.58

Let’s try with a batch size of 128 as well to see if it makes a difference.

Test Model 24: Batch Size 128

# Create model
tmodel24 <- keras_model_sequential()

# Create layers
tmodel24 %>%
  layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel24 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training24 <- tmodel24 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128, # Updated
  verbose = 0
 )

# Plot training
plot(training24) 

# Predict classes for test data
tmod24class <- tmodel24 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod24result <- confusionMatrix(as.factor(tmod24class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod24result # Test set accuracy is 0.5743 - similar to batch size of 32. Precision is 0.5797, Recall is 0.5458 (a bit higher), and F1 is 0.5622 (higher)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3841 2903
##          1 2530 3489
##                                           
##                Accuracy : 0.5743          
##                  95% CI : (0.5657, 0.5829)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1487          
##                                           
##  Mcnemar's Test P-Value : 4.491e-07       
##                                           
##               Precision : 0.5797          
##                  Recall : 0.5458          
##                      F1 : 0.5622          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2734          
##    Detection Prevalence : 0.4716          
##       Balanced Accuracy : 0.5744          
##                                           
##        'Positive' Class : 1               
## 
tmod24roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod24class))
auc(tmod24roc) # AUC = 0.5877 - higher than for batch size of 32
## Area under the curve: 0.5744

Test Model 24, with a batch size of 128, had a similar accuracy, precision, and AUC to Test Model 7 (batch size 32), with a higher recall and f1 score. Let’s try running test model 24 with cross validation to see how it performs.

Cross Validation for Model 24: Batch Size 128

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval24 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model24 <- keras_model_sequential()
  
  # Layers
  model24 %>%
    layer_dense(units = 16, activation = 'relu', input_shape = c(81)) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model24 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model24 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 20, 
    batch_size = 128,
    verbose = 0
  )

  # Predict classes for test data
  mod24class <- model24 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod24result <- confusionMatrix(as.factor(mod24class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval24$accuracy[f] <- mod24result$overall['Accuracy']
  eval24$precision[f] <- mod24result$byClass['Precision']
  eval24$recall[f] <- mod24result$byClass['Recall']
  eval24$f1[f] <- mod24result$byClass['F1']
  mod24roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod24class))
  eval24$auc[f] <- auc(mod24roc)
}

eval24
##     accuracy precision    recall        f1       auc
## 1  0.5904568 0.5946693 0.5724343 0.5833400 0.5904865
## 2  0.5699914 0.5696300 0.5777916 0.5736818 0.5699798
## 3  0.5953721 0.6065063 0.5470579 0.5752503 0.5954553
## 4  0.5792931 0.5923326 0.5137371 0.5502424 0.5794164
## 5  0.5748124 0.5923312 0.4846262 0.5330930 0.5749678
## 6  0.5785630 0.5915518 0.5116826 0.5487261 0.5786629
## 7  0.5787077 0.5898438 0.5208529 0.5532057 0.5787941
## 8  0.5781164 0.5871225 0.5306667 0.5574689 0.5781873
## 9  0.5680725 0.5767737 0.5147267 0.5439867 0.5681273
## 10 0.5699827 0.5700543 0.5755013 0.5727648 0.5699732
summary(eval24) # The mean of each evaluation metric is: Accuracy: 0.5783, Precision: 0.5871, Recall: 0.5349, f1: 0.5592, AUC: 0.5784
##     accuracy        precision          recall             f1        
##  Min.   :0.5681   Min.   :0.5696   Min.   :0.4846   Min.   :0.5331  
##  1st Qu.:0.5712   1st Qu.:0.5794   1st Qu.:0.5140   1st Qu.:0.5491  
##  Median :0.5783   Median :0.5907   Median :0.5258   Median :0.5553  
##  Mean   :0.5783   Mean   :0.5871   Mean   :0.5349   Mean   :0.5592  
##  3rd Qu.:0.5791   3rd Qu.:0.5923   3rd Qu.:0.5661   3rd Qu.:0.5735  
##  Max.   :0.5954   Max.   :0.6065   Max.   :0.5778   Max.   :0.5833  
##       auc        
##  Min.   :0.5681  
##  1st Qu.:0.5712  
##  Median :0.5784  
##  Mean   :0.5784  
##  3rd Qu.:0.5793  
##  Max.   :0.5955

Let’s compare the two model configurations (batch size of 32 and 128) that had 10-fold cross validation run on them.

# Batch size of 32
summary(eval7)[4,] # Had the highest mean recall, though only by 0.001
##           accuracy          precision             recall                 f1 
## "Mean   :0.5745  " "Mean   :0.5821  " "Mean   :0.5359  " "Mean   :0.5574  " 
##                auc 
## "Mean   :0.5746  "
# Batch size of 128
summary(eval24)[4,] # Had the highest mean accuracy, precision, f1 and AUC
##           accuracy          precision             recall                 f1 
## "Mean   :0.5783  " "Mean   :0.5871  " "Mean   :0.5349  " "Mean   :0.5592  " 
##                auc 
## "Mean   :0.5784  "

We will switch to the batch size of 128 due to the slightly higher overall evaluation metrics.

Dropout Regularization

In order to minimize overfitting and allow the model to generalize better, we’re going to experiment with adding dropout layers. With dropout layers, a rate is chosen (e.g., 20% dropout), and that percentage of random inputs to the dropout layer are dropped (ignored) during training. This forces the model to generalize more. We are going to add a dropout layer after the input layer as well as after each hidden layer and experiment with a few different values for rate.

Test Model 25: 20% Dropout After Input and Hidden Layers

# Create model
tmodel25 <- keras_model_sequential()

# Create layers
tmodel25 %>%
  layer_dropout(rate = 0.2, input_shape = c(81)) %>% # Dropout layer after input layer
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>% # Dropout layer after first hidden layer
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>% # Dropout layer after second hidden layer
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel25 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training25 <- tmodel25 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training25) # The validation loss and accuracy lines are following the training ones slightly more closely now, which we want to see. The validation set accuracy is decreasing towards the end though, which we don't want

# Predict classes for test data
tmod25class <- tmodel25 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod25result <- confusionMatrix(as.factor(tmod25class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod25result # Test set accuracy is 0.6022 - improved! Recall is 0.5598 with Precision at 0.6126, which are both improved as well
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4108 2814
##          1 2263 3578
##                                           
##                Accuracy : 0.6022          
##                  95% CI : (0.5937, 0.6107)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2045          
##                                           
##  Mcnemar's Test P-Value : 1.173e-14       
##                                           
##               Precision : 0.6126          
##                  Recall : 0.5598          
##                      F1 : 0.5850          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2803          
##    Detection Prevalence : 0.4577          
##       Balanced Accuracy : 0.6023          
##                                           
##        'Positive' Class : 1               
## 
tmod25roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod25class))
auc(tmod25roc) # 0.6023 - also improved
## Area under the curve: 0.6023

Let’s try with a higher dropout rate, at 50% for each layer.

Test Model 26: 50% Dropout After Input and Hidden Layers

# Create model
tmodel26 <- keras_model_sequential()

# Create layers
tmodel26 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel26 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training26 <- tmodel26 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training26) # The validation lines for loss and accuracy follow the training lines even more closely in trend and shape.

# Predict classes for test data
tmod26class <- tmodel26 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod26result <- confusionMatrix(as.factor(tmod26class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod26result # Test set accuracy is 0.6194. Recall at 0.6187, Precision at 0.6204, F1 at 0.6196. All of these are the best yet.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3951 2437
##          1 2420 3955
##                                          
##                Accuracy : 0.6194         
##                  95% CI : (0.611, 0.6279)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.2389         
##                                          
##  Mcnemar's Test P-Value : 0.8184         
##                                          
##               Precision : 0.6204         
##                  Recall : 0.6187         
##                      F1 : 0.6196         
##              Prevalence : 0.5008         
##          Detection Rate : 0.3099         
##    Detection Prevalence : 0.4995         
##       Balanced Accuracy : 0.6194         
##                                          
##        'Positive' Class : 1              
## 
tmod26roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod26class))
auc(tmod26roc) # AUC = 0.6194 - also the best yet
## Area under the curve: 0.6194

Let’s try with a different dropout rate for after the input layer than after the hidden layers.

Test Model 27: 50% Dropout After Input Layer with 20% Dropout After Each Hidden Layer

# Create model
tmodel27 <- keras_model_sequential()

# Create layers
tmodel27 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel27 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training27 <- tmodel27 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training27) # The validation lines for loss and accuracy follow the training lines a little less closely than the previous iteration

# Predict classes for test data
tmod27class <- tmodel27 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod27result <- confusionMatrix(as.factor(tmod27class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod27result # Test set accuracy is 0.603. Recall at 0.5440 - decreased from last test. Precision is still high at 0.6177
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4219 2915
##          1 2152 3477
##                                           
##                Accuracy : 0.603           
##                  95% CI : (0.5944, 0.6115)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2061          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6177          
##                  Recall : 0.5440          
##                      F1 : 0.5785          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2724          
##    Detection Prevalence : 0.4410          
##       Balanced Accuracy : 0.6031          
##                                           
##        'Positive' Class : 1               
## 
tmod27roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod27class))
auc(tmod27roc) # AUC = 0.6031
## Area under the curve: 0.6031

Test Model 28: 80% Dropout After Input Layer with 50% Dropout After Each Hidden Layer

# Create model
tmodel28 <- keras_model_sequential()

# Create layers
tmodel28 %>%
  layer_dropout(rate = 0.8, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel28 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training28 <- tmodel28 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training28) # The validation line for loss actually stays consistently below the training loss line, although the validation accuracy line is a bit erratic

# Predict classes for test data
tmod28class <- tmodel28 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod28result <- confusionMatrix(as.factor(tmod28class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod28result # Test set accuracy is 0.6137. Recall is 0.6366 - the highest yet. Precision = 0.6095, F1 = 0.6227
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3764 2323
##          1 2607 4069
##                                           
##                Accuracy : 0.6137          
##                  95% CI : (0.6052, 0.6222)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2274          
##                                           
##  Mcnemar's Test P-Value : 5.565e-05       
##                                           
##               Precision : 0.6095          
##                  Recall : 0.6366          
##                      F1 : 0.6227          
##              Prevalence : 0.5008          
##          Detection Rate : 0.3188          
##    Detection Prevalence : 0.5231          
##       Balanced Accuracy : 0.6137          
##                                           
##        'Positive' Class : 1               
## 
tmod28roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod28class))
auc(tmod28roc) # AUC = 0.6137
## Area under the curve: 0.6137

Test Model 26, with a 50% droput after every layer, had better performance in all metrics than Model 24 (the one to beat). It had the highest accuracy (0.6194) and AUC (0.6194) so far. Test Model 28 had 80% dropout after the input layer with 50% dropout after each hidden layer, and had the highest recall yet, at 0.6366. As such, we will use cross-validation on both of these test models to compare them accurately.

Cross Validation for Model 26: 50% Dropout After Input and Hidden Layers

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval26 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model26 <- keras_model_sequential()
  
  # Layers
  model26 %>%
    layer_dropout(rate = 0.5, input_shape = c(81)) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model26 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model26 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 20, 
    batch_size = 128,
    verbose = 0
  )

  # Predict classes for test data
  mod26class <- model26 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod26result <- confusionMatrix(as.factor(mod26class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval26$accuracy[f] <- mod26result$overall['Accuracy']
  eval26$precision[f] <- mod26result$byClass['Precision']
  eval26$recall[f] <- mod26result$byClass['Recall']
  eval26$f1[f] <- mod26result$byClass['F1']
  mod26roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod26class))
  eval26$auc[f] <- auc(mod26roc)
}

eval26
##     accuracy precision    recall        f1       auc
## 1  0.6163128 0.6135846 0.6317272 0.6225237 0.6162874
## 2  0.5863682 0.5988671 0.5268650 0.5605634 0.5864565
## 3  0.6138993 0.6198172 0.5926331 0.6059204 0.6139359
## 4  0.5980607 0.6027264 0.5797690 0.5910248 0.5980951
## 5  0.5888055 0.5903576 0.5848291 0.5875804 0.5888124
## 6  0.5871221 0.6033623 0.5121531 0.5540288 0.5872341
## 7  0.5907985 0.5972639 0.5613045 0.5787262 0.5908425
## 8  0.6205326 0.6309807 0.5833725 0.6062434 0.6205881
## 9  0.6029168 0.5947156 0.6487636 0.6205650 0.6028698
## 10 0.6049741 0.6311491 0.5084586 0.5631994 0.6051410
summary(eval26) # The mean of each evaluation metric is: Accuracy: 0.6010, Precision: 0.6083, Recall: 0.5730, f1: 0.5890, AUC: 0.6010
##     accuracy        precision          recall             f1        
##  Min.   :0.5864   Min.   :0.5904   Min.   :0.5085   Min.   :0.5540  
##  1st Qu.:0.5893   1st Qu.:0.5977   1st Qu.:0.5355   1st Qu.:0.5671  
##  Median :0.6005   Median :0.6030   Median :0.5816   Median :0.5893  
##  Mean   :0.6010   Mean   :0.6083   Mean   :0.5730   Mean   :0.5890  
##  3rd Qu.:0.6117   3rd Qu.:0.6183   3rd Qu.:0.5907   3rd Qu.:0.6062  
##  Max.   :0.6205   Max.   :0.6311   Max.   :0.6488   Max.   :0.6225  
##       auc        
##  Min.   :0.5865  
##  1st Qu.:0.5893  
##  Median :0.6005  
##  Mean   :0.6010  
##  3rd Qu.:0.6117  
##  Max.   :0.6206

Cross Validation for Model 28: 80% Dropout After Input Layer with 50% Dropout After Each Hidden Layer

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval28 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model28 <- keras_model_sequential()
  
  # Layers
  model28 %>%
    layer_dropout(rate = 0.8, input_shape = c(81)) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model28 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model28 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 20, 
    batch_size = 128,
    verbose = 0
  )

  # Predict classes for test data
  mod28class <- model28 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod28result <- confusionMatrix(as.factor(mod28class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval28$accuracy[f] <- mod28result$overall['Accuracy']
  eval28$precision[f] <- mod28result$byClass['Precision']
  eval28$recall[f] <- mod28result$byClass['Recall']
  eval28$f1[f] <- mod28result$byClass['F1']
  mod28roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod28class))
  eval28$auc[f] <- auc(mod28roc)
}

eval28
##     accuracy precision    recall        f1       auc
## 1  0.5461882 0.5256016 0.9635482 0.6801767 0.5455004
## 2  0.5906574 0.6421640 0.4122411 0.5021341 0.5909222
## 3  0.6046748 0.6067362 0.5988762 0.6027806 0.6046848
## 4  0.5704567 0.6459866 0.3153294 0.4237910 0.5709364
## 5  0.5627736 0.6249233 0.3177774 0.4213140 0.5631957
## 6  0.5809973 0.6010483 0.4854947 0.5371270 0.5811399
## 7  0.5885216 0.6227067 0.4523362 0.5240214 0.5887251
## 8  0.6016809 0.6625935 0.4167843 0.5116996 0.6019573
## 9  0.5857312 0.6031297 0.5038589 0.5490432 0.5858152
## 10 0.5928919 0.6212213 0.4796366 0.5413241 0.5930877
summary(eval28) # The mean of each evaluation metric is: Accuracy: 0.5825, Precision: 0.6156, Recall: 0.4946, f1: 0.5293, AUC: 0.5826. Of note, the recall between the folds has a large range, ranging from 0.31 to 0.96
##     accuracy        precision          recall             f1        
##  Min.   :0.5462   Min.   :0.5256   Min.   :0.3153   Min.   :0.4213  
##  1st Qu.:0.5731   1st Qu.:0.6040   1st Qu.:0.4134   1st Qu.:0.5045  
##  Median :0.5871   Median :0.6220   Median :0.4660   Median :0.5306  
##  Mean   :0.5825   Mean   :0.6156   Mean   :0.4946   Mean   :0.5293  
##  3rd Qu.:0.5923   3rd Qu.:0.6379   3rd Qu.:0.4993   3rd Qu.:0.5471  
##  Max.   :0.6047   Max.   :0.6626   Max.   :0.9635   Max.   :0.6802  
##       auc        
##  Min.   :0.5455  
##  1st Qu.:0.5735  
##  Median :0.5873  
##  Mean   :0.5826  
##  3rd Qu.:0.5925  
##  Max.   :0.6047

Let’s compare the two new model configurations (Models 26 and 28) that had 10-fold cross validation run on them against the model to beat, Model 24 (no dropout layers).

# No dropout layers
summary(eval24)[4,]
##           accuracy          precision             recall                 f1 
## "Mean   :0.5783  " "Mean   :0.5871  " "Mean   :0.5349  " "Mean   :0.5592  " 
##                auc 
## "Mean   :0.5784  "
# All dropout layers 50%
summary(eval26)[4,] # Had the highest mean accuracy, recall, f1 and AUC
##           accuracy          precision             recall                 f1 
## "Mean   :0.6010  " "Mean   :0.6083  " "Mean   :0.5730  " "Mean   :0.5890  " 
##                auc 
## "Mean   :0.6010  "
# 80% dropout after input layer, 50% dropout after hidden layers
summary(eval28)[4,] # Had the highest mean precision
##           accuracy          precision             recall                 f1 
## "Mean   :0.5825  " "Mean   :0.6156  " "Mean   :0.4946  " "Mean   :0.5293  " 
##                auc 
## "Mean   :0.5826  "

We will be moving forward with Model 26 (50% dropout after every layer), as it has the highest evaluation metric in every category except for precision, where it was less than 0.01 less than the highest performer in that category. Additionally, Model 28 had a large range of values for recall between folds (ranging from about 0.31 to 0.96), indicating that it isn’t performing consistently.

Weight regularization

In order to further counter the overfitting, we’re going experiment with adding some weight regularization as well. Weight regularization works by adding an additional term to the cost function, leading to a reduction in the values of weights. The idea is that the lower weights lead to a simpler model, which should be beneficial in addressing overfitting. There are two types of weight regularization, L1 and L2 (weight decay). L1 adds the term to the cost function based on the absolute values of the weights, whereas L2 adds the term based on the squared value of the weights. As such, the weights may be reduced to zero in L1 but not in L2, which is why L2 is generally preferred over L1.

First we will try L1 regularization with a couple different weights.

Test Model 29: L1 Regularization with Rate of 0.001

# Create model
tmodel29 <- keras_model_sequential()

# Create layers
tmodel29 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu', kernel_regularizer = regularizer_l1(l = 0.001)) %>% # Add the regularizer and rate to each hidden layer
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu', kernel_regularizer = regularizer_l1(l = 0.001)) %>% # Added here too
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel29 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training29 <- tmodel29 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training29) # The training set very quickly reaches it's final value for loss and accuracy. The validation accuracy peaks quickly, with values that are lower than what we've already seen

# Predict classes for test data
tmod29class <- tmodel29 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod29result <- confusionMatrix(as.factor(tmod29class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod29result # Test set accuracy is 0.5845. Recall at 0.3429 (very low), Precision at 0.6653, F1 at 0.4526
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5268 4200
##          1 1103 2192
##                                           
##                Accuracy : 0.5845          
##                  95% CI : (0.5759, 0.5931)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1697          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6653          
##                  Recall : 0.3429          
##                      F1 : 0.4526          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1717          
##    Detection Prevalence : 0.2582          
##       Balanced Accuracy : 0.5849          
##                                           
##        'Positive' Class : 1               
## 
tmod29roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod29class))
auc(tmod29roc) # AUC = 0.5849
## Area under the curve: 0.5849

Test Model 30: L1 Regularization with Rate of 0.01

# Create model
tmodel30 <- keras_model_sequential()

# Create layers
tmodel30 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu', kernel_regularizer = regularizer_l1(l = 0.01)) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu', kernel_regularizer = regularizer_l1(l = 0.01)) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel30 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training30 <- tmodel30 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training30) # Well this is a mess. The accuracy for each epoch is all over the place (training and validation), with the validation set loss never changing. This doesn't appear to be a viable parameter configuration

# Predict classes for test data
tmod30class <- tmodel30 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod30result <- confusionMatrix(as.factor(tmod30class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
## Warning in confusionMatrix.default(as.factor(tmod30class), as.factor(db[db[, :
## Levels are not in the same order for reference and data. Refactoring data to
## match.
tmod30result # Test set accuracy is 0.5008. Recall is at 1.0, because the model just guessed 1 for every prediction. We will not be using this configuration
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0    0    0
##          1 6371 6392
##                                           
##                Accuracy : 0.5008          
##                  95% CI : (0.4921, 0.5095)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 0.5035          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##               Precision : 0.5008          
##                  Recall : 1.0000          
##                      F1 : 0.6674          
##              Prevalence : 0.5008          
##          Detection Rate : 0.5008          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 
tmod30roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod30class))
auc(tmod30roc) # AUC = 0.5
## Area under the curve: 0.5

Let’s try L2 regularization (weight decay) with a few different rates.

Test Model 31: L2 Regularization with Rate of 0.001

# Create model
tmodel31 <- keras_model_sequential()

# Create layers
tmodel31 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu', kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu', kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel31 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training31 <- tmodel31 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training31) # The validation set accuracy appears to be decreasing by the end, which we do not want. Otherwise, the two loss lines are closely following one another

# Predict classes for test data
tmod31class <- tmodel31 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod31result <- confusionMatrix(as.factor(tmod31class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod31result # Test set accuracy is 0.5853. Recall is fairly low, 0.3777
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5056 3978
##          1 1315 2414
##                                           
##                Accuracy : 0.5853          
##                  95% CI : (0.5767, 0.5939)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1711          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6474          
##                  Recall : 0.3777          
##                      F1 : 0.4770          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1891          
##    Detection Prevalence : 0.2922          
##       Balanced Accuracy : 0.5856          
##                                           
##        'Positive' Class : 1               
## 
tmod31roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod31class))
auc(tmod31roc) # AUC = 0.5856
## Area under the curve: 0.5856

Test Model 32: L2 Regularization with Rate of 0.01

# Create model
tmodel32 <- keras_model_sequential()

# Create layers
tmodel32 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu', kernel_regularizer = regularizer_l2(l = 0.01)) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu', kernel_regularizer = regularizer_l2(l = 0.01)) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel32 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training32 <- tmodel32 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training32) # The validation accuracy points seem somewhat erratic, and may be decreasing by the end

# Predict classes for test data
tmod32class <- tmodel32 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod32result <- confusionMatrix(as.factor(tmod32class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod32result # Test set accuracy is 0.5901. Recall is fairly low, 0.4233
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4826 3686
##          1 1545 2706
##                                           
##                Accuracy : 0.5901          
##                  95% CI : (0.5816, 0.5987)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1807          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6366          
##                  Recall : 0.4233          
##                      F1 : 0.5085          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2120          
##    Detection Prevalence : 0.3331          
##       Balanced Accuracy : 0.5904          
##                                           
##        'Positive' Class : 1               
## 
tmod32roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod32class))
auc(tmod32roc) # AUC = 0.5904
## Area under the curve: 0.5904

Overall, these four test models performed quite poorly, so we will not be moving forward with cross validation for any of these variants. It’s likely that the dropout layers we previously added provide enough to regularize the weights already. We will continue to move forward with the model parameters that don’t include weight regularization.

Optimizer

The optimizer is the algorithm that manages the updates to the weights and biases of each neuron, leading to locating the local minima of the cost function (usually through some form of gradient descent). We have been using the Adam optimizer previously, which is an improvement on the RMSProp optimizer. We will try a few others to see if they improve model performance: namely Nadam and Adamax. Stochastic Gradient Descent (SGD) is an option, but is better suited to shallow networks, so we will not be using this.

Test Model 33: Nadam

# Create model
tmodel33 <- keras_model_sequential()

# Create layers
tmodel33 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel33 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'nadam', # Updated
  metrics = 'accuracy'
 )

# Fit model
training33 <- tmodel33 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training33) # The two loss curves follow each other more closely. The validation set accuracy curves above that of the training set

# Predict classes for test data
tmod33class <- tmodel33 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod33result <- confusionMatrix(as.factor(tmod33class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod33result # Test set accuracy is 0.6034. Recall is 0.4959 - not the highest we've seen
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4531 3222
##          1 1840 3170
##                                           
##                Accuracy : 0.6034          
##                  95% CI : (0.5948, 0.6119)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.207           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6327          
##                  Recall : 0.4959          
##                      F1 : 0.5560          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2484          
##    Detection Prevalence : 0.3925          
##       Balanced Accuracy : 0.6036          
##                                           
##        'Positive' Class : 1               
## 
tmod33roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod33class))
auc(tmod33roc) # AUC = 0.6036
## Area under the curve: 0.6036

Test Model 34: Adamax

# Create model
tmodel34 <- keras_model_sequential()

# Create layers
tmodel34 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel34 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adamax',
  metrics = 'accuracy'
 )

# Fit model
training34 <- tmodel34 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training34) # The model appears to be learning more slowly, as the loss and accuracy for both training and validation set don't appear to have hit a constant value for each yet

# Predict classes for test data
tmod34class <- tmodel34 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod34result <- confusionMatrix(as.factor(tmod34class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod34result # Test set accuracy is 0.5969. Recall is 0.4599 - not great
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4678 3452
##          1 1693 2940
##                                           
##                Accuracy : 0.5969          
##                  95% CI : (0.5883, 0.6054)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1941          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6346          
##                  Recall : 0.4599          
##                      F1 : 0.5333          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2304          
##    Detection Prevalence : 0.3630          
##       Balanced Accuracy : 0.5971          
##                                           
##        'Positive' Class : 1               
## 
tmod34roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod34class))
auc(tmod34roc) # AUC = 0.5971
## Area under the curve: 0.5971

Let’s try that last model again, with more epochs to see if it hits a better constant value.

Test Model 35: Adamax with 50 Epochs

# Create model
tmodel35 <- keras_model_sequential()

# Create layers
tmodel35 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel35 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adamax',
  metrics = 'accuracy'
 )

# Fit model
training35 <- tmodel35 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 50,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training35) # Looks great. Still continuing to increase in accuracy and decrease in loss for both sets though. Need more epochs to see it stabilize

# Predict classes for test data
tmod35class <- tmodel35 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod35result <- confusionMatrix(as.factor(tmod35class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod35result # Test set accuracy is 0.6075. Recall is 0.6344
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3698 2337
##          1 2673 4055
##                                           
##                Accuracy : 0.6075          
##                  95% CI : (0.5989, 0.6159)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2148          
##                                           
##  Mcnemar's Test P-Value : 2.214e-06       
##                                           
##               Precision : 0.6027          
##                  Recall : 0.6344          
##                      F1 : 0.6181          
##              Prevalence : 0.5008          
##          Detection Rate : 0.3177          
##    Detection Prevalence : 0.5271          
##       Balanced Accuracy : 0.6074          
##                                           
##        'Positive' Class : 1               
## 
tmod35roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod35class))
auc(tmod35roc) # AUC = 0.6074
## Area under the curve: 0.6074

Let’s try it again with 200 epochs.

Test Model 36: Adamax with 200 Epochs

# Create model
tmodel36 <- keras_model_sequential()

# Create layers
tmodel36 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel36 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adamax',
  metrics = 'accuracy'
 )

# Fit model
training36 <- tmodel36 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 200,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training36) # I wouldn't add any more epochs. The values appear to have stabilized for the most part.

# Predict classes for test data
tmod36class <- tmodel36 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod36result <- confusionMatrix(as.factor(tmod36class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod36result # Test set accuracy is 0.61. Recall is 0.6263. Precision = 0.6073, f1 = 0.6167
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3783 2389
##          1 2588 4003
##                                           
##                Accuracy : 0.61            
##                  95% CI : (0.6015, 0.6185)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.22            
##                                           
##  Mcnemar's Test P-Value : 0.005007        
##                                           
##               Precision : 0.6073          
##                  Recall : 0.6263          
##                      F1 : 0.6167          
##              Prevalence : 0.5008          
##          Detection Rate : 0.3136          
##    Detection Prevalence : 0.5164          
##       Balanced Accuracy : 0.6100          
##                                           
##        'Positive' Class : 1               
## 
tmod36roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod36class))
auc(tmod36roc) # 0.61
## Area under the curve: 0.61

Let’s use cross validation on Model 36 (Adamax optimizer with 200 epochs) to better evaluate it and compare to other tests, as it performed the best out of the different new optimizers tried.

Cross Validation for Model 36: Adamax with 200 Epochs

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval36 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model36 <- keras_model_sequential()
  
  # Layers
  model36 %>%
    layer_dropout(rate = 0.5, input_shape = c(81)) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model36 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adamax',
    metrics = 'accuracy'
  )

  # Train
  model36 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 200, 
    batch_size = 128,
    verbose = 0
  )

  # Predict classes for test data
  mod36class <- model36 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod36result <- confusionMatrix(as.factor(mod36class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval36$accuracy[f] <- mod36result$overall['Accuracy']
  eval36$precision[f] <- mod36result$byClass['Precision']
  eval36$recall[f] <- mod36result$byClass['Recall']
  eval36$f1[f] <- mod36result$byClass['F1']
  mod36roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod36class))
  eval36$auc[f] <- auc(mod36roc)
}

eval36
##     accuracy precision    recall        f1       auc
## 1  0.6133354 0.6094008 0.6348561 0.6218681 0.6133000
## 2  0.5981440 0.6162876 0.5232830 0.5659901 0.5982550
## 3  0.6136648 0.6221444 0.5823318 0.6015801 0.6137188
## 4  0.6003284 0.6142177 0.5435529 0.5767288 0.6004352
## 5  0.5928705 0.6014211 0.5548619 0.5772041 0.5929360
## 6  0.5835885 0.5876592 0.5645288 0.5758618 0.5836170
## 7  0.5989636 0.6040301 0.5780809 0.5907707 0.5989948
## 8  0.6113424 0.6263055 0.5549804 0.5884897 0.6114266
## 9  0.5869137 0.5984378 0.5309498 0.5626773 0.5869711
## 10 0.6017574 0.6038095 0.5958647 0.5998108 0.6017676
summary(eval36) # The mean of each evaluation metric is: Accuracy: 0.6001, Precision: 0.6084, Recall: 0.5663, f1: 0.5861, AUC: 0.6001.
##     accuracy        precision          recall             f1        
##  Min.   :0.5836   Min.   :0.5877   Min.   :0.5233   Min.   :0.5627  
##  1st Qu.:0.5942   1st Qu.:0.6020   1st Qu.:0.5464   1st Qu.:0.5761  
##  Median :0.5996   Median :0.6067   Median :0.5598   Median :0.5828  
##  Mean   :0.6001   Mean   :0.6084   Mean   :0.5663   Mean   :0.5861  
##  3rd Qu.:0.6089   3rd Qu.:0.6158   3rd Qu.:0.5813   3rd Qu.:0.5976  
##  Max.   :0.6137   Max.   :0.6263   Max.   :0.6349   Max.   :0.6219  
##       auc        
##  Min.   :0.5836  
##  1st Qu.:0.5943  
##  Median :0.5997  
##  Mean   :0.6001  
##  3rd Qu.:0.6090  
##  Max.   :0.6137

Let’s compare the two model configurations (Models 26 and 36) that had 10-fold cross validation run on them.

# Adam optimizer, 20 epochs
summary(eval26)[4,] # Had the highest mean accuracy, recall, f1 and AUC
##           accuracy          precision             recall                 f1 
## "Mean   :0.6010  " "Mean   :0.6083  " "Mean   :0.5730  " "Mean   :0.5890  " 
##                auc 
## "Mean   :0.6010  "
# Adamax optimizer, 200 epochs
summary(eval36)[4,] # Had the highest mean precision
##           accuracy          precision             recall                 f1 
## "Mean   :0.6001  " "Mean   :0.6084  " "Mean   :0.5663  " "Mean   :0.5861  " 
##                auc 
## "Mean   :0.6001  "

Based on the specifications above, Model 26 (Adam optimizer) still performed slightly better overall, so we will move forward with that one.

Let’s take a moment to look at the individual weights and biases for Test Model 26, to see what the network did.

get_weights(tmodel26) # Most values have a magnitude less than 1, and none appear to have a magnitude greater than 3. None of the biases have a negative value larger than 1, indicating that we likely haven't run into the dying ReLU problem (a neuron no longer getting activated due to the activation function ReLU). This may often be identified by biases with a large negative value. It may still be beneficial to try out another activation function to see how it performs.
## [[1]]
##                [,1]         [,2]         [,3]          [,4]          [,5]
##  [1,] -0.4744611382  0.091398567 -0.187214285 -0.2475107163 -0.2432888299
##  [2,] -0.1798749119  0.011789901 -0.003095849 -0.2266997248  0.0623118058
##  [3,]  0.0525413230 -0.097153105 -0.235216469 -0.4917439520 -0.2046884745
##  [4,] -0.4226634502  0.147816285 -0.399975002 -0.0092467768 -0.2901850641
##  [5,] -0.4678824246  0.052772947 -0.198252186 -0.1279293299 -0.1110378802
##  [6,] -2.0461866856  1.581022859 -0.214450076 -0.3069192469 -0.3721113503
##  [7,] -2.5687897205  2.797765732 -0.390226156 -0.1442270279 -0.2704053819
##  [8,] -0.1901292354 -0.097085126 -0.088863529  0.0339989811 -0.3409554958
##  [9,] -0.0977857038 -0.018017551 -0.008364690 -0.2467742413 -0.0247658510
## [10,]  0.1667502075 -0.134816244  0.138807073 -0.2183353454  0.2081900239
## [11,] -0.0108955679  0.018517017 -0.209073320 -0.1875336766 -0.0140417274
## [12,] -0.0931004137 -0.104698673 -0.070449166  0.0114893243 -0.0358868726
## [13,] -0.0177020989 -0.042292528 -0.008786266  0.0732486323  0.0279095359
## [14,] -0.1077123657 -0.018172529 -0.120100588 -0.0410162285 -0.1100746021
## [15,]  0.1237144992  0.058136985  0.005391622 -0.0700873509  0.3021941781
## [16,]  0.2619885504  0.057566501  0.109226935 -0.2960078716 -0.2457196414
## [17,]  0.1722671688 -0.341968656 -0.159220517  0.1478145719 -0.1228408664
## [18,]  0.0465227477 -0.028466316  0.034007926  0.1572000682  0.1603648663
## [19,] -0.0041601667 -0.031345185 -0.251043439 -0.2233726680 -0.0362931080
## [20,]  0.0007413736  0.011581764  0.269451588 -0.3454548717 -0.0007586729
## [21,]  0.0878919065  0.001604949  0.007157370  0.2286967933 -0.3618330956
## [22,]  0.1683121175 -0.307422876  0.151782915 -0.1552150548  0.3720378280
## [23,]  0.3131779432 -0.052340396  0.192512751 -0.0025313806  0.0855317339
## [24,] -0.7709640265  0.525748968 -0.171243086  0.0129226008 -0.2221648097
## [25,] -0.5717882514  0.233642340 -0.099970289 -0.1263518333 -0.3637835383
## [26,] -0.0323696844 -0.140859663  0.153374255 -0.0835736245 -0.0878601447
## [27,] -0.4441852868  0.001832522 -0.081061013  0.0094652772  0.0199440513
## [28,]  0.0523975603  0.034140728  0.065561570  0.0734221712 -0.4586292505
## [29,]  0.1740417331 -0.008612580 -0.078810409  0.2274374664  0.0054268823
## [30,] -0.0434276387 -0.033266053 -0.282297462 -0.2134826332 -0.0690614730
## [31,]  0.2216911018 -0.075665563  0.297479808  0.0563992485  0.0975237265
## [32,] -0.1218753979 -0.061424140 -0.085555188 -0.0828741714 -0.1840407848
## [33,]  0.0025400301 -0.012068258 -0.207486391 -0.2614930570 -0.2893897295
## [34,]  0.1037850752  0.095208205 -0.072880067  0.0507391542  0.2020456940
## [35,]  0.0743255392 -0.079695247 -0.192372784  0.0327446200 -0.3709440529
## [36,]  0.0863342583 -0.019944148 -0.022456979 -0.1549053937 -0.0746523440
## [37,]  0.3816109002  0.076137088 -0.038504526 -0.0741993561  0.1803618670
## [38,]  0.3865472078 -0.002542265 -0.356637001  0.1569095999  0.3187173307
## [39,]  0.0129980827  0.076589428 -0.310941547 -0.1445499361 -0.1511767954
## [40,] -0.0985939056 -0.015693352 -0.339993954 -0.3149557114 -0.2736521661
## [41,] -0.0117302081  0.028644156  0.001015756 -0.2169634104 -0.1284189224
## [42,]  0.1112844348 -0.041213479  0.165599197 -0.2863043845  0.1956043392
## [43,] -0.0010653872  0.059844527 -0.110626101 -0.1701860130  0.0937363654
## [44,]  0.0723421201 -0.167514324 -0.138907790 -0.2839051783 -0.0834635869
## [45,] -0.0503417850 -0.088305324 -0.092025526  0.0879081115  0.0109452866
## [46,]  0.0757294074 -0.053499594  0.215191871 -0.3283601403  0.0896172002
## [47,]  0.0236787368  0.053217035 -0.119839676 -0.0001953825 -0.2409596592
## [48,]  0.1644816250 -0.031989127  0.118795402 -0.1874615401  0.2425470352
## [49,]  0.0544472747  0.028361931 -0.270689040  0.0132893389 -0.0114021366
## [50,] -0.1389004439  0.093941122  0.111453012 -0.0334816091  0.0028890765
## [51,] -0.0142018544  0.010027661 -0.180257350  0.1662189066  0.0917456448
## [52,] -0.2159003615 -0.122409754 -0.023584461  0.2637435198  0.2999412119
## [53,]  0.0020385098  0.049636971 -0.243358284  0.1309656352 -0.0298611280
## [54,]  0.1368301958 -0.029071130 -0.251736134  0.0641335323 -0.4025590420
## [55,]  0.3591759503 -0.039066028 -0.153692946  0.2190609127 -0.1124475449
## [56,] -0.3072775304 -0.043538824 -0.032733139  0.1228456050 -0.0192041174
## [57,]  0.0515943281  0.003055822 -0.014828934 -0.2822896242  0.0527163148
## [58,]  0.0661525279  0.028561842 -0.226352379  0.1120872051 -0.2178914100
## [59,] -0.2479060441 -0.017227797  0.077455424 -0.1917449683  0.0926536992
## [60,] -0.3308248818  0.312202126  0.549488842 -0.1175060794 -0.0277673993
## [61,] -0.0021001818  0.001287479  0.080321833 -0.2871219516  0.1153624132
## [62,] -0.0689250529  0.007879294  0.154512420 -0.1881354600 -0.0355971046
## [63,]  0.1266168207  0.049903572  0.232967928 -0.1065210328  0.0238907132
## [64,]  0.1091683432 -0.007923260 -0.192168698  0.1938128769  0.0242670830
## [65,]  0.0197675992 -0.003487623 -0.184444293  0.1024005711  0.0947470218
## [66,] -0.2386209518 -0.004372508  0.078880079  0.2797681689  0.1966972798
## [67,] -0.1625175178  0.115717083 -0.208096474  0.2450907677 -0.1819360107
## [68,]  0.0421966054  0.158398345 -0.252113581  0.0748318732 -0.1045325994
## [69,] -0.0088523189  0.008349219 -0.173992291 -0.0676166937 -0.0105111282
## [70,]  0.1967635304  0.007881850  0.260986954 -0.0368493721  0.0322781950
## [71,]  0.1622193605  0.066022463 -0.179098278 -0.2506955564  0.0295410100
## [72,] -0.0134736802 -0.297213972 -0.127503633 -0.2471965849  0.4009665847
## [73,] -0.0353540108 -0.026327951  0.010465349 -0.0766417310 -0.0236031935
## [74,] -0.0391207524 -0.063762933 -0.068782359 -0.1803694069 -0.1073842943
## [75,]  0.0508499518  0.111616030 -0.107481726  0.0022462532  0.3289725482
## [76,] -0.1811694950 -0.007123023  0.025759529  0.0323934779  0.0821433738
## [77,]  0.1318688691  0.019811222 -0.035925671 -0.0442908071  0.0394914821
## [78,]  0.0101416484  0.029792557  0.070105456 -0.2067441791 -0.1333857328
## [79,] -0.1958223134  0.033081964  0.032661233  0.0669448376 -0.0371215716
## [80,] -0.0411981232  0.031261712 -0.092858195 -0.0964703709 -0.1224219799
## [81,] -0.0641681328  0.016789114 -0.103965871 -0.1344909966 -0.1411970854
##               [,6]          [,7]          [,8]          [,9]        [,10]
##  [1,] -0.257802218 -0.0056327637 -0.1076273769  0.1928538829 -0.531507969
##  [2,] -0.145501375 -0.4028689563 -0.2397145778  0.0045594210 -0.165708557
##  [3,]  0.050766818 -0.1107862219  0.0433640927 -0.0077940752 -0.125819713
##  [4,] -0.348353207 -0.0297216643  0.2350000441  0.0868052393 -0.454649895
##  [5,] -0.308921814 -0.0148157375  0.0613166168 -0.0319254585 -1.008991718
##  [6,] -0.206030235 -0.3108774722 -0.1011315659  1.1688309908 -1.993787408
##  [7,] -0.394821107 -0.1749595851 -0.1031597853  2.4642386436 -2.547766447
##  [8,] -0.114908800 -0.1025013998 -0.0524380431 -0.0334888510 -0.190742999
##  [9,] -0.232252270 -0.1484559327 -0.1845673174  0.0274505392 -0.093038447
## [10,] -0.077218302 -0.2494490743 -0.1773994714  0.0109171942  0.091726452
## [11,]  0.108850241 -0.1699929535 -0.0228691231  0.0025155651 -0.027845152
## [12,] -0.091436937 -0.0166727193 -0.0341802388 -0.0598961860  0.073919788
## [13,] -0.269401491 -0.1955740303 -0.0654988587  0.0041061779  0.001959899
## [14,]  0.080197506 -0.0111857345 -0.2743629217 -0.0474879295  0.019556986
## [15,]  0.201778635 -0.0455407389 -0.2153884470  0.0475240871  0.090403713
## [16,]  0.214523271  0.1547513157 -0.1819475144  0.0884690657  0.264144808
## [17,] -0.114453301 -0.0922123566  0.4982845485 -0.3714559078  0.299077004
## [18,]  0.039552297  0.0039358526 -0.0559326410  0.0052011572  0.009096425
## [19,] -0.114277534 -0.2848306596 -0.0711969137  0.0131989541  0.038982596
## [20,] -0.233010963 -0.0885333195 -0.0655806437 -0.0766438991 -0.126110271
## [21,]  0.106786840  0.0160310306  0.0531414896  0.0391954333  0.121668629
## [22,] -0.237851501 -0.1608917713  0.2173246443 -0.4851530492  0.292159736
## [23,] -0.184922054 -0.2731628716  0.1097773015 -0.0805207789  0.256054699
## [24,] -0.118185192 -0.1984232515 -0.2180998623  0.4460337162 -0.508521616
## [25,] -0.302837372 -0.2308765352 -0.2543512583  0.2404632419 -0.517207742
## [26,]  0.239055812 -0.1046831533 -0.1248659119 -0.1476313621 -0.336750329
## [27,] -0.163186446 -0.1533781737 -0.1750784069  0.2170280665 -0.095877998
## [28,] -0.063632950 -0.0742518008 -0.1651816517  0.0070830509  0.030149881
## [29,] -0.037056755 -0.1279372722  0.1204895824 -0.0338074714 -0.058140796
## [30,] -0.189611048 -0.0454498790 -0.1910407543  0.0272977706  0.030939978
## [31,]  0.151585281  0.2134986669 -0.0565186180  0.0420045033  0.338317961
## [32,] -0.206899837 -0.2205915004  0.0103340438  0.0019681985 -0.379898250
## [33,] -0.012781896  0.1003973261  0.0925311670 -0.0648376122 -0.015818780
## [34,] -0.075260662 -0.0275032502  0.0620154105  0.0699484423  0.089593768
## [35,]  0.057217639 -0.0348775424  0.0355256908 -0.1154868677  0.133712634
## [36,] -0.096847400 -0.2758915126  0.0459066443 -0.1178369969  0.049834609
## [37,]  0.108062111  0.0879517794  0.0417071022  0.0783710703  0.124685079
## [38,]  0.403556168 -0.0418083742  0.1750472486 -0.1148315221  0.330043942
## [39,] -0.309291542 -0.2014702111 -0.0490213223  0.1187083572 -0.041991543
## [40,] -0.171878353 -0.1938920319 -0.0509811975 -0.0571357086 -0.038169261
## [41,]  0.082290821 -0.1858649254  0.1528714150  0.0494672395 -0.003571761
## [42,]  0.174002320 -0.2105005533 -0.1877307147 -0.0912929028  0.065140560
## [43,]  0.126745194 -0.2230162472  0.0493682586  0.0533214509  0.079901487
## [44,] -0.382633954 -0.1514020711  0.1885271072 -0.1132807285  0.145076722
## [45,]  0.219053179 -0.2875238955  0.0832951739 -0.1116206646  0.065680236
## [46,]  0.067387171  0.0456332229 -0.1023019776 -0.0222038589 -0.049643293
## [47,] -0.110656776  0.1099962592  0.1495590210 -0.0023633756  0.004430787
## [48,]  0.086691506 -0.2184102386  0.3205723763 -0.1203987971  0.144143403
## [49,]  0.086273439 -0.0545188002 -0.2224826962  0.0290714838  0.162318468
## [50,] -0.135201663 -0.0848936662  0.0049742460 -0.0050840946 -0.119884603
## [51,] -0.058889791 -0.2334650159  0.0428085029 -0.0123039177  0.025918251
## [52,]  0.049057655  0.0635254607  0.1447765678  0.1506416649 -0.265821904
## [53,] -0.270852000 -0.0782335401  0.0985008925 -0.0038342790  0.004723256
## [54,]  0.048185792 -0.0284904521  0.0613035336 -0.0375982486  0.045021512
## [55,]  0.370410651 -0.0196769759  0.4161186516 -0.0041146455  0.358646452
## [56,] -0.253217399 -0.1862851530 -0.0423660986  0.1780086905 -0.204252243
## [57,] -0.069926038  0.0009242784 -0.1619463265  0.0154919252 -0.014434374
## [58,]  0.100244209 -0.0953818113 -0.2227675170  0.0491314568  0.167542234
## [59,] -0.140148625  0.1830464751  0.1339520961  0.2057352960 -0.159849450
## [60,] -0.083665885 -0.0785651058  0.2194958329  0.2879646719 -0.252705574
## [61,] -0.058184136 -0.2053086907 -0.1098247617  0.0198641494  0.047255870
## [62,] -0.062440496 -0.2223356813 -0.1398391277  0.0049154628 -0.136833772
## [63,] -0.254741371 -0.1086601913  0.1269215792  0.0525904335 -0.203498200
## [64,]  0.046485014 -0.0307020191 -0.0835781395 -0.0135720577 -0.350374907
## [65,]  0.032845721 -0.3469908834 -0.0611180253 -0.0221097395  0.008982004
## [66,] -0.058819700 -0.1904055625 -0.0001631547 -0.0006166251  0.001845732
## [67,] -0.226198032 -0.1011059433 -0.0584375784  0.1103054807  0.008376229
## [68,]  0.368791640  0.0511831380 -0.1317526847  0.2422449440 -0.151546165
## [69,]  0.020998208 -0.0334135853  0.1081357002  0.0088916766  0.033256739
## [70,] -0.007146577 -0.2019908875  0.0224034432  0.0264698043  0.090625249
## [71,]  0.081443295  0.0635148883 -0.1914258599  0.0895851180  0.070332944
## [72,] -0.032471430  0.1442612559  0.2078636587 -0.4312616587  0.562777996
## [73,]  0.095352113 -0.3125906289 -0.3467710316  0.0016825877  0.036081079
## [74,] -0.009735497 -0.2189487666 -0.1202993765 -0.0229235161  0.075250357
## [75,] -0.246736705 -0.0698337331  0.1453523338  0.0822047219  0.196132034
## [76,]  0.063417941 -0.0207150821 -0.2987449169  0.0256156661 -0.107863940
## [77,] -0.029036980 -0.0545223430 -0.1599484235 -0.0164206345  0.029688066
## [78,]  0.046548191  0.0434894972 -0.0072096144 -0.0047869692  0.071330160
## [79,] -0.140886799 -0.0844238847 -0.0475001745 -0.0217919536 -0.053584740
## [80,] -0.026822686 -0.0157549363 -0.3020585775  0.0030340801 -0.055141330
## [81,] -0.058091082 -0.2050761878 -0.2056654990  0.0427383855 -0.042496391
##               [,11]        [,12]         [,13]        [,14]         [,15]
##  [1,] -0.0710136592 -0.487771422 -0.2203033268 -0.231756136 -0.2839539051
##  [2,] -0.2295146286 -0.284072310  0.0617673695 -0.101767257 -0.1507085115
##  [3,] -0.3054929674  0.096098229 -0.1453036517 -0.353723109 -0.3415461779
##  [4,] -0.3120836616 -0.207944423 -0.2134544253  0.071387872 -0.1988171190
##  [5,] -0.2755326331 -0.650728106 -0.1528834403 -0.228098318 -0.3615146279
##  [6,]  0.0071954834 -2.178744793  0.0621708781 -0.248690158 -1.4137767553
##  [7,]  0.1413865387 -2.376432896  0.1522153169 -0.034439173 -1.3625097275
##  [8,] -0.0533963740 -0.205480307 -0.2014870942 -0.045763858 -0.1620923132
##  [9,] -0.1759506464 -0.078235224  0.0311782863 -0.294261128 -0.0675592870
## [10,]  0.0639396086  0.146636203 -0.1844322979 -0.057247616  0.0748512521
## [11,] -0.0917623565 -0.026295761 -0.1488022655 -0.099308230 -0.0172488093
## [12,] -0.2573539615  0.187220618 -0.2126026750  0.169372171  0.0441259071
## [13,] -0.0331740081  0.040027034 -0.0531393327  0.036331989 -0.0575705320
## [14,] -0.1171728820 -0.111582950 -0.0968569517 -0.060095824  0.2335180193
## [15,] -0.1137683243  0.007720924  0.0115447771  0.007917630  0.1201812848
## [16,] -0.1346403658  0.201773673  0.1317747533  0.078358829  0.2364296317
## [17,]  0.1594445854  0.128684700  0.1216464043  0.084995151  0.3528075218
## [18,] -0.0218621828  0.004076805 -0.0271501373 -0.082784519 -0.0195033234
## [19,]  0.0522188172 -0.009628363  0.0218424164 -0.264405489 -0.0653942898
## [20,] -0.2213021815 -0.007586547 -0.1167760342 -0.308072597  0.0006106663
## [21,] -0.2092267722  0.193570554 -0.2781115770  0.058580518  0.1906855851
## [22,]  0.2027448863  0.324041903 -0.0570207760 -0.067062587  0.2669789493
## [23,] -0.1329244226  0.302748382 -0.0842495114 -0.054166384  0.4211502075
## [24,]  0.1271416843 -0.823291540 -0.2067330629 -0.100679286 -0.4211224020
## [25,] -0.1144285873 -0.535088956 -0.0294162128  0.022399044 -0.3033709228
## [26,]  0.0131037449 -0.104923926  0.1901395470 -0.091116451  0.2033056468
## [27,]  0.1582564265 -0.445683926  0.0824803486 -0.184461311 -0.0321657471
## [28,]  0.1186422408  0.027074277 -0.2570028007 -0.259386122 -0.1029160470
## [29,] -0.0018751121  0.181576237 -0.2031308711 -0.165749162 -0.2837514281
## [30,] -0.1119342446 -0.030681992 -0.3090742826 -0.023473144 -0.0511109456
## [31,]  0.2239737958  0.445229918 -0.1572626978  0.018386915  0.4491410255
## [32,] -0.0002395836 -0.057560448 -0.0526463985 -0.294397831 -0.1953770667
## [33,]  0.1022805125 -0.018752659 -0.1544186175 -0.070883989 -0.2353011519
## [34,] -0.1380985677 -0.127923697 -0.2736808658  0.046580072  0.0759363472
## [35,]  0.1801286191  0.032032974  0.1268131435 -0.094955079  0.0489025898
## [36,]  0.0444522277 -0.022165569  0.0922014117 -0.194109991  0.0148691377
## [37,] -0.2286944687  0.365025610 -0.1382862926  0.174852923  0.3085762560
## [38,]  0.1159218103  0.365730554 -0.2048754990 -0.024605563  0.4049143493
## [39,]  0.0490409695  0.029964993 -0.0743234903  0.105000757  0.0225370433
## [40,] -0.2984863222  0.030560484 -0.1467812955 -0.160605296 -0.1526614726
## [41,] -0.0730804577  0.119542912 -0.1142113954  0.012216437  0.0023352674
## [42,]  0.0275977198 -0.011733160 -0.1395177543 -0.073383115  0.1884500533
## [43,] -0.2666295469  0.003105301 -0.1662295610 -0.235622197 -0.1041152775
## [44,]  0.2577759027  0.098873764 -0.0599317700 -0.136319622 -0.0359626636
## [45,] -0.1841894537 -0.159995019 -0.1434683502  0.021254126 -0.0984162018
## [46,] -0.0825181156  0.007174247 -0.1176562607 -0.147892356  0.0205421001
## [47,] -0.3300105631 -0.049935199 -0.2007492632  0.039159678 -0.0440703109
## [48,]  0.1517933011  0.095313817 -0.1475107372  0.089201629  0.1610018462
## [49,] -0.2030555308 -0.048491064 -0.0486847945 -0.137053236  0.0356265940
## [50,]  0.0047141388 -0.196434870 -0.2142837644 -0.008347993 -0.2135644704
## [51,]  0.0425298251 -0.277667254  0.0798293278 -0.170325458  0.0716776922
## [52,] -0.2864761651 -0.164227411  0.0936327130 -0.136189237  0.1422805190
## [53,]  0.0435176194 -0.006538100 -0.0642145872 -0.025209209 -0.0649705827
## [54,]  0.0605729893  0.138458371 -0.0775973722 -0.191085190  0.1150834709
## [55,]  0.0384717509  0.538801730 -0.1620532423  0.083426774 -0.0600459315
## [56,] -0.1293826997  0.021105150 -0.1121921241  0.053573251  0.1968387812
## [57,] -0.2523369491  0.043833129 -0.2508256435 -0.079511903  0.0162264984
## [58,]  0.0498727299  0.038419113 -0.0482849143 -0.319404125  0.1720103621
## [59,] -0.1215927079 -0.315539390  0.0394617096  0.203036144 -0.1109617874
## [60,]  0.0489147827 -0.465661764 -0.0816089958 -0.167076826 -0.1103878170
## [61,] -0.4075862467  0.046044368  0.0027075410 -0.181560859  0.0691598281
## [62,] -0.2809633315 -0.152715623 -0.0049436949 -0.205124795 -0.1707656533
## [63,] -0.0260208249 -0.026622439 -0.1812308580 -0.135415852 -0.0776172653
## [64,]  0.1332481354  0.133842334 -0.1505799294  0.141455427  0.1910368055
## [65,] -0.2603686750  0.033611920 -0.0749320611 -0.017914318  0.0398164503
## [66,] -0.1962014437 -0.190411702 -0.1584569961 -0.134648502 -0.0980728045
## [67,] -0.1207523718 -0.227527395 -0.1216397509  0.259373724 -0.4248484671
## [68,]  0.1202774867  0.196738094  0.0634825006  0.065226711 -0.1749262810
## [69,] -0.2340621650 -0.008187074 -0.1814334244 -0.146471485 -0.0306987278
## [70,]  0.0700844899  0.194222495 -0.1054346412 -0.126231357  0.1733865291
## [71,]  0.3181712627 -0.024214387  0.2091979384  0.138804108  0.2250586152
## [72,] -0.1788464487  0.185002536  0.1848395467  0.175527170  0.5690791011
## [73,] -0.3173543811  0.041309770 -0.3483048081 -0.190592483  0.0134174814
## [74,] -0.1927359849 -0.144953683  0.0231695119 -0.039572731  0.1150394082
## [75,]  0.2351162136  0.279198319  0.2805402577 -0.147326261  0.1798221320
## [76,]  0.1856537163 -0.261724532 -0.0001433469  0.030832183 -0.0789337531
## [77,]  0.0305022635  0.066625677  0.0529176816 -0.206688911  0.0552158691
## [78,] -0.1485370249 -0.076290242  0.0655860081 -0.098943576 -0.0237786062
## [79,]  0.0845851228 -0.268483847 -0.0916571319 -0.077556275  0.0238297060
## [80,] -0.0758066028 -0.131950021 -0.2127140313 -0.201774448  0.0908948332
## [81,]  0.1478091776 -0.156054094  0.0646641999 -0.226145715  0.0169899743
##               [,16]
##  [1,] -0.4300935566
##  [2,] -0.1401621997
##  [3,]  0.1150099710
##  [4,] -0.1324533373
##  [5,] -0.5470058322
##  [6,] -2.2346961498
##  [7,] -2.6332626343
##  [8,] -0.1841695011
##  [9,] -0.1005178690
## [10,]  0.1043847203
## [11,]  0.0002507912
## [12,]  0.0413616598
## [13,] -0.0331002884
## [14,] -0.0323027633
## [15,]  0.1784579754
## [16,]  0.0674697682
## [17,]  0.3459659219
## [18,]  0.0099868514
## [19,] -0.0393291079
## [20,]  0.0019438918
## [21,]  0.0832676589
## [22,]  0.2347235978
## [23,]  0.2335026860
## [24,] -0.6349512339
## [25,] -0.7737290859
## [26,] -0.2101472169
## [27,] -0.5537182093
## [28,]  0.0698426738
## [29,]  0.1207861602
## [30,] -0.0161050279
## [31,]  0.1888734102
## [32,] -0.0254481416
## [33,] -0.0062148562
## [34,] -0.0576015450
## [35,]  0.2310924530
## [36,]  0.1970362365
## [37,]  0.2669972479
## [38,]  0.3324396610
## [39,] -0.0349953510
## [40,] -0.0055416483
## [41,]  0.0371350050
## [42,]  0.1044559255
## [43,]  0.0912161320
## [44,]  0.0622247942
## [45,] -0.1372398436
## [46,]  0.0464738011
## [47,] -0.0354917236
## [48,]  0.1869966686
## [49,] -0.0006836216
## [50,] -0.2916248441
## [51,] -0.2563623488
## [52,] -0.2086483240
## [53,] -0.0053041717
## [54,]  0.0482723005
## [55,]  0.3213051558
## [56,] -0.1468328238
## [57,]  0.0595815256
## [58,]  0.1476159692
## [59,] -0.2631801963
## [60,] -0.0293128267
## [61,]  0.0502126552
## [62,]  0.0210346002
## [63,] -0.0205882583
## [64,]  0.0975225270
## [65,]  0.0525226966
## [66,] -0.0700045004
## [67,] -0.1103378162
## [68,]  0.0706494004
## [69,]  0.0332317464
## [70,]  0.1811582595
## [71,]  0.0049025659
## [72,]  0.2064361572
## [73,]  0.0017072528
## [74,]  0.1051082984
## [75,]  0.1121048480
## [76,] -0.1241668165
## [77,]  0.0818331614
## [78,]  0.0003941300
## [79,] -0.2370436341
## [80,]  0.0284725782
## [81,] -0.0057544005
## 
## [[2]]
##  [1]  0.35031578 -0.20683560 -0.23679197 -0.19065921 -0.09307931 -0.15666138
##  [7] -0.27721131 -0.16263062 -0.19175047  0.35301086 -0.19274144  0.20593151
## [13] -0.23181124 -0.30327514  0.07566964  0.33082721
## 
## [[3]]
##              [,1]        [,2]        [,3]        [,4]         [,5]        [,6]
##  [1,]  0.11986508  0.24731196 -0.16549858 -0.50789183  0.095594473 -0.26505566
##  [2,] -0.51961052 -0.50415283  0.37480432 -0.32321692 -0.565705419 -0.23067553
##  [3,]  0.39178640  0.03407484  0.18166554 -0.29721749 -0.189538956 -0.35992721
##  [4,] -0.47204083 -0.13197379  0.16495614 -0.19868268  0.190140769  0.22457214
##  [5,]  0.37892237 -0.01344236  0.15793011 -0.05041032  0.065123536  0.43762845
##  [6,] -0.04100191 -0.08445283 -0.05052401 -0.18467376 -0.154869705 -0.18813914
##  [7,]  0.39943594  0.42310381  0.02170711  0.09245098 -0.219742104 -0.06255849
##  [8,]  0.33461580 -0.07488811 -0.12557667 -0.37430844  0.217971653  0.31175405
##  [9,] -0.38059255 -0.44768104  0.15593299  0.38247997 -0.256365359 -0.23726982
## [10,]  0.25634509 -0.13715458 -0.35257632 -0.76128864  0.269537836 -0.05220510
## [11,]  0.08205573  0.35482469 -0.24839839 -0.01454031 -0.035537459 -0.19449964
## [12,] -0.20987538 -0.07407729 -0.19011050 -0.60800773  0.065348856  0.14754060
## [13,]  0.10991675  0.24975264 -0.02845556  0.15472426 -0.081176259 -0.40517357
## [14,]  0.07946952 -0.13803571  0.23804055  0.11283384  0.032382768  0.12387663
## [15,]  0.04076029  0.26977518 -0.34093517 -0.27254936  0.009639575  0.29733843
## [16,] -0.16424078 -0.36575729 -0.41786593 -0.60602331  0.239297464  0.40497452
##              [,7]        [,8]        [,9]       [,10]       [,11]       [,12]
##  [1,] -0.12768577  0.13024792  0.14383379 -0.35429567 -0.25390193 -0.27745312
##  [2,] -0.38070783 -0.43071175 -0.33482447 -0.29977995 -0.33730373  0.59173405
##  [3,]  0.17688581  0.43548131 -0.09681421  0.08788560 -0.23885334 -0.11372710
##  [4,] -0.01590967 -0.43435293  0.21390243  0.22897077 -0.22091535 -0.22497757
##  [5,] -0.01497591  0.06709518 -0.02702704 -0.26995927  0.04086852 -0.43274269
##  [6,]  0.28138691  0.21073885  0.29738927 -0.05321148  0.08641134  0.26397493
##  [7,]  0.02373080  0.33397558  0.15359963  0.27457148  0.05739127  0.06001977
##  [8,]  0.26268327  0.45660198 -0.04909526 -0.02171792  0.14674409  0.47818929
##  [9,] -0.16839769 -0.36961693 -0.24986438 -0.09775114  0.63892567  0.26504040
## [10,]  0.41726860  0.13495176 -0.01212762 -0.32250062 -0.23137560 -0.38331264
## [11,] -0.02299185 -0.09600778 -0.19681405 -0.27148128  0.30940893  0.37874341
## [12,]  0.24911171  0.17745730  0.09338094 -0.23845315 -0.04087940 -0.21214145
## [13,] -0.29769656 -0.25555760  0.08020817  0.08682333 -0.21393192  0.20666941
## [14,]  0.42843586 -0.27072215  0.18600188 -0.34532508 -0.35643184 -0.07149003
## [15,] -0.20668960  0.11946258  0.11641532  0.07862190 -0.06657954 -0.29462305
## [16,] -0.11520042  0.16717286  0.16444756 -0.42101377 -0.19952345 -0.33656833
##             [,13]       [,14]       [,15]       [,16]
##  [1,] -0.34242114 -0.53879106  0.22944580 -0.21533526
##  [2,] -0.01145979 -0.07705618 -0.25196031  0.27268147
##  [3,]  0.02778339  0.07783276  0.01080747 -0.10598873
##  [4,] -0.16399544 -0.32651287  0.11975214  0.33550832
##  [5,]  0.09891582  0.16552129  0.29765469  0.06849788
##  [6,] -0.05182368 -0.15605305  0.26229560  0.02022241
##  [7,] -0.15942138  0.17960057 -0.13662338  0.19695809
##  [8,] -0.11102036  0.28964496 -0.25538987 -0.12239090
##  [9,] -0.03084478 -0.16666771 -0.25770295  0.08348688
## [10,] -0.15938607 -0.12597246 -0.27388281 -0.51724994
## [11,]  0.16857861  0.23873733  0.06545246  0.09027810
## [12,] -0.05601545 -0.43317080  0.12024207 -0.18654326
## [13,]  0.20617320 -0.22016709  0.01582287  0.12806755
## [14,]  0.24817459 -0.01742225 -0.19581033 -0.06538159
## [15,] -0.04026410 -0.16545206 -0.09980524 -0.21058720
## [16,] -0.50724709 -0.03549514 -0.04917835 -0.25018784
## 
## [[4]]
##  [1] -0.0011962950  0.0207611937  0.1673451960  0.1311319172  0.0086995093
##  [6]  0.0005795559 -0.0163547453  0.0167355742  0.0147546772 -0.4028857052
## [11] -0.0108792437  0.2385722846 -0.3350689113 -0.3133530617  0.0115298694
## [16]  0.1601425856
## 
## [[5]]
##              [,1]
##  [1,] -0.13596712
##  [2,] -0.10939260
##  [3,]  0.19428617
##  [4,]  0.20587212
##  [5,] -0.06395119
##  [6,] -0.08024102
##  [7,] -0.07860627
##  [8,] -0.09305688
##  [9,] -0.08227061
## [10,]  0.10831133
## [11,]  0.17825000
## [12,]  0.11933225
## [13,]  0.29591465
## [14,]  0.00667819
## [15,] -0.13737613
## [16,]  0.26985550
## 
## [[6]]
## [1] 0.01247375

Activation Function

The activation function is the function applied to the weights and biases of each input to each neuron (excluding the inputs to the input layer) that alters or condenses values. We have been using the Rectified Linear Unit (ReLU) activation function. We are also going to try the Scaled Exponential Linear Unit (SELU) to see if it improves performance at all. SELU fixes the dying ReLU problem (even though that is likely not an issue here), and also avoids the vanishing gradient issue associated with the sigmoid activation function.

Test Model 37: SELU

# Create model
tmodel37 <- keras_model_sequential()

# Create layers
tmodel37 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'selu') %>% # Updated from ReLU to SELU for each dense hidden layer
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'selu') %>% # Updated from ReLU to SELU for each dense hidden layer
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel37 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training37 <- tmodel37 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 20,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training37) # The values appear to have stabilized by the 20th epoch

# Predict classes for test data
tmod37class <- tmodel37 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod37result <- confusionMatrix(as.factor(tmod37class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod37result # Test set accuracy is 0.6115. Recall lower at 0.5393, Precision at 0.6313, F1 at 0.5817.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4358 2945
##          1 2013 3447
##                                        
##                Accuracy : 0.6115       
##                  95% CI : (0.603, 0.62)
##     No Information Rate : 0.5008       
##     P-Value [Acc > NIR] : < 2.2e-16    
##                                        
##                   Kappa : 0.2233       
##                                        
##  Mcnemar's Test P-Value : < 2.2e-16    
##                                        
##               Precision : 0.6313       
##                  Recall : 0.5393       
##                      F1 : 0.5817       
##              Prevalence : 0.5008       
##          Detection Rate : 0.2701       
##    Detection Prevalence : 0.4278       
##       Balanced Accuracy : 0.6117       
##                                        
##        'Positive' Class : 1            
## 
tmod37roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod37class))
auc(tmod37roc) # 0.6117
## Area under the curve: 0.6117

Let’s try cross-validation out on Test Model 37 so that we may better compare it to the leading best performer, Model 26 (ReLU).

Cross Validation for Model 37: SELU

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval37 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model37 <- keras_model_sequential()
  
  # Layers
  model37 %>%
    layer_dropout(rate = 0.5, input_shape = c(81)) %>%
    layer_dense(units = 16, activation = 'selu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 16, activation = 'selu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model37 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model37 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 20, 
    batch_size = 128,
    verbose = 0
  )

  # Predict classes for test data
  mod37class <- model37 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod37result <- confusionMatrix(as.factor(mod37class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval37$accuracy[f] <- mod37result$overall['Accuracy']
  eval37$precision[f] <- mod37result$byClass['Precision']
  eval37$recall[f] <- mod37result$byClass['Recall']
  eval37$f1[f] <- mod37result$byClass['F1']
  mod37roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod37class))
  eval37$auc[f] <- auc(mod37roc)
}

eval37
##     accuracy precision    recall        f1       auc
## 1  0.6105148 0.6276271 0.5466208 0.5843298 0.6106201
## 2  0.5978320 0.6135465 0.5318486 0.5697839 0.5979299
## 3  0.6093652 0.6233812 0.5559544 0.5877403 0.6094572
## 4  0.6068971 0.6247061 0.5391820 0.5788018 0.6070244
## 5  0.5851313 0.6044833 0.4966443 0.5452832 0.5852838
## 6  0.5807617 0.5956858 0.5066646 0.5475807 0.5808724
## 7  0.5954306 0.6132372 0.5200690 0.5628234 0.5955432
## 8  0.6104784 0.6267454 0.5491765 0.5854026 0.6105700
## 9  0.5933780 0.6010178 0.5580406 0.5787324 0.5934143
## 10 0.6091323 0.6227241 0.5571742 0.5881283 0.6092221
summary(eval37) # The mean of each evaluation metric is: Accuracy: 0.5999, Precision: 0.6153, Recall: 0.5361, f1: 0.5729, AUC: 0.6000.
##     accuracy        precision          recall             f1        
##  Min.   :0.5808   Min.   :0.5957   Min.   :0.4966   Min.   :0.5453  
##  1st Qu.:0.5939   1st Qu.:0.6067   1st Qu.:0.5230   1st Qu.:0.5646  
##  Median :0.6024   Median :0.6181   Median :0.5429   Median :0.5788  
##  Mean   :0.5999   Mean   :0.6153   Mean   :0.5361   Mean   :0.5729  
##  3rd Qu.:0.6093   3rd Qu.:0.6244   3rd Qu.:0.5543   3rd Qu.:0.5851  
##  Max.   :0.6105   Max.   :0.6276   Max.   :0.5580   Max.   :0.5881  
##       auc        
##  Min.   :0.5809  
##  1st Qu.:0.5939  
##  Median :0.6025  
##  Mean   :0.6000  
##  3rd Qu.:0.6094  
##  Max.   :0.6106

Let’s compare the two model configurations (Models 26 and 37) that had 10-fold cross validation run on them.

# ReLU
summary(eval26)[4,] # Had the highest mean accuracy, recall, f1 and AUC
##           accuracy          precision             recall                 f1 
## "Mean   :0.6010  " "Mean   :0.6083  " "Mean   :0.5730  " "Mean   :0.5890  " 
##                auc 
## "Mean   :0.6010  "
# SELU
summary(eval37)[4,] # Had the highest mean precision
##           accuracy          precision             recall                 f1 
## "Mean   :0.5999  " "Mean   :0.6153  " "Mean   :0.5361  " "Mean   :0.5729  " 
##                auc 
## "Mean   :0.6000  "

Model 26 is the best performer here, with all evaluation metrics higher except for precision. Recall is quite a bit higher for Model 26 as well.

Epochs for Final Model

An epoch is one iteration of all training data going through the model. Now that we have selected the best performing model configuration from the parameters tuned, we will try running it with more epochs to see if it improves further with more training. Let’s try it with 50 epochs first (up from 20).

Test Model 38: 50 Epochs

# Create model
tmodel38 <- keras_model_sequential()

# Create layers
tmodel38 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel38 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training38 <- tmodel38 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 50, # Updated
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training38) # The accuracy for both sets still appears as if it may be improving, with the loss still decreasing

# Predict classes for test data
tmod38class <- tmodel38 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod38result <- confusionMatrix(as.factor(tmod38class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod38result # Test set accuracy is 0.6140. Recall at 0.6291, Precision at 0.6140, F1 at 0.6214. These appear to be an improvement over the 20 epoch model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3843 2371
##          1 2528 4021
##                                           
##                Accuracy : 0.6162          
##                  95% CI : (0.6077, 0.6246)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.2323          
##                                           
##  Mcnemar's Test P-Value : 0.02583         
##                                           
##               Precision : 0.6140          
##                  Recall : 0.6291          
##                      F1 : 0.6214          
##              Prevalence : 0.5008          
##          Detection Rate : 0.3151          
##    Detection Prevalence : 0.5131          
##       Balanced Accuracy : 0.6161          
##                                           
##        'Positive' Class : 1               
## 
tmod38roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod38class))
auc(tmod38roc) # AUC = 0.6161
## Area under the curve: 0.6161

Let’s try with a 100 epochs to see if there is any further improvement.

Test Model 39: 100 Epochs

# Create model
tmodel39 <- keras_model_sequential()

# Create layers
tmodel39 %>%
  layer_dropout(rate = 0.5, input_shape = c(81)) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile
tmodel39 %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
 )

# Fit model
training39 <- tmodel39 %>% fit(
  db[db[,83] != 1, 1:81], 
  db[db[,83] != 1, 82], 
  epochs = 100,
  validation_data = list(db[db[,83] == 1, 1:81], db[db[,83] == 1, 82]),
  batch_size = 128,
  verbose = 0
 )

# Plot training
plot(training39) # The validation set accuracy appears to peak around 50 epochs and then start trending down. The validation set loss also decreases to around 50 epochs before starting to gradually increase again. 50 epochs is likely a good amount for model training

# Predict classes for test data
tmod39class <- tmodel39 %>% predict_classes(db[db[,83] == 1, 1:81], batch_size = 128)

# Evaluation
tmod39result <- confusionMatrix(as.factor(tmod39class), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1")
tmod39result # Test set accuracy is 0.6067. Recall at 0.5192, Precision at 0.6303, F1 at 0.5694. Scores, particularly recall, have decreased from the last iteration
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4424 3073
##          1 1947 3319
##                                           
##                Accuracy : 0.6067          
##                  95% CI : (0.5981, 0.6152)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2136          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6303          
##                  Recall : 0.5192          
##                      F1 : 0.5694          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2600          
##    Detection Prevalence : 0.4126          
##       Balanced Accuracy : 0.6068          
##                                           
##        'Positive' Class : 1               
## 
tmod39roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(tmod39class))
auc(tmod39roc) # AUC = 0.6068
## Area under the curve: 0.6068

Let’s try using cross validation to better evaluate the model with 50 epochs (Test Model 38).

Cross Validation for Model 38: 50 Epochs

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval38 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Train model per fold
for (f in 1:10){

  # Initiate
  model38 <- keras_model_sequential()
  
  # Layers
  model38 %>%
    layer_dropout(rate = 0.5, input_shape = c(81)) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model38 %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model38 %>% fit(
    db[db[,83] != f, 1:81], 
    db[db[,83] != f, 82], 
    epochs = 50, 
    batch_size = 128,
    verbose = 0
  )

  # Predict classes for test data
  mod38class <- model38 %>% predict_classes(db[db[,83] == f, 1:81], batch_size = 128)

  # Evaluation
  mod38result <- confusionMatrix(as.factor(mod38class), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval38$accuracy[f] <- mod38result$overall['Accuracy']
  eval38$precision[f] <- mod38result$byClass['Precision']
  eval38$recall[f] <- mod38result$byClass['Recall']
  eval38$f1[f] <- mod38result$byClass['F1']
  mod38roc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(mod38class))
  eval38$auc[f] <- auc(mod38roc)
}

eval38
##     accuracy precision    recall        f1       auc
## 1  0.6062838 0.6145851 0.5735294 0.5933479 0.6063378
## 2  0.6058645 0.6119207 0.5819966 0.5965837 0.6058999
## 3  0.6138993 0.6301880 0.5545497 0.5899543 0.6140016
## 4  0.6056459 0.6132624 0.5760225 0.5940594 0.6057016
## 5  0.5915416 0.5897358 0.6062120 0.5978604 0.5915163
## 6  0.5860228 0.5826230 0.6109456 0.5964483 0.5859855
## 7  0.5974719 0.6068683 0.5569144 0.5808192 0.5975325
## 8  0.6144058 0.6194589 0.5962353 0.6076253 0.6144329
## 9  0.5946393 0.6014456 0.5635533 0.5818832 0.5946712
## 10 0.6015220 0.6158353 0.5433897 0.5773488 0.6016226
summary(eval38) # The mean of each evaluation metric is: Accuracy: 0.6017, Precision: 0.6086, Recall: 0.5763, f1: 0.5916, AUC: 0.6018.
##     accuracy        precision          recall             f1        
##  Min.   :0.5860   Min.   :0.5826   Min.   :0.5434   Min.   :0.5773  
##  1st Qu.:0.5953   1st Qu.:0.6028   1st Qu.:0.5586   1st Qu.:0.5839  
##  Median :0.6036   Median :0.6126   Median :0.5748   Median :0.5937  
##  Mean   :0.6017   Mean   :0.6086   Mean   :0.5763   Mean   :0.5916  
##  3rd Qu.:0.6062   3rd Qu.:0.6155   3rd Qu.:0.5927   3rd Qu.:0.5965  
##  Max.   :0.6144   Max.   :0.6302   Max.   :0.6109   Max.   :0.6076  
##       auc        
##  Min.   :0.5860  
##  1st Qu.:0.5954  
##  Median :0.6037  
##  Mean   :0.6018  
##  3rd Qu.:0.6062  
##  Max.   :0.6144

Let’s compare the two model configurations (Models 26 and 38) that had 10-fold cross validation run on them.

# 20 Epochs
summary(eval26)[4,]
##           accuracy          precision             recall                 f1 
## "Mean   :0.6010  " "Mean   :0.6083  " "Mean   :0.5730  " "Mean   :0.5890  " 
##                auc 
## "Mean   :0.6010  "
# 50 Epochs
summary(eval38)[4,] # The highest by a small amount in every category
##           accuracy          precision             recall                 f1 
## "Mean   :0.6017  " "Mean   :0.6086  " "Mean   :0.5763  " "Mean   :0.5916  " 
##                auc 
## "Mean   :0.6018  "

Model 38 had the highest evaluation metric in every category by a small amount. Therefore, the optimized neural network algorithm is Model 38: an input layer of 81 neurons, two hidden layers with 16 neurons each, and an output layer with one neuron and the sigmoid activation function. The hidden layers each use the ReLU activation function. There is a dropout layer with 50% dropout after the input layer and each of the hidden layers. The optimizer Adam is used, with a batch size of 128 and 50 epochs. No weight regularization is used.

Now that we have optimized our neural network algorithm, we will try out a few traditional classifiers for comparison: k-Nearest Neighbours (kNN), Random Forest, and Logistic Regression.

kNN Model Creation and Evaluation

The k-nearest neighbours algorithm (kNN) classifies cases based on their proximity to known cases using a distance function. The main parameter we will change here to optimize the network is the k value, which is the number of nearest neighbours considered in each calculation.

set.seed(31) # Make results reproducible

First we will test an example with k = 5.

Test k = 5

knn5 <- knn(train = db[db[,83] != 1, 1:81], test = db[db[,83] == 1, 1:81], cl = db[db[,83] != 1, 82], k = 5) # Took about 7 min to run; not feasible to systematically go through every k from 1:100 (etc) to identify optimal k
confusionMatrix(as.factor(knn5), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy is 0.5337 Recall somewhat low at 0.4140
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4165 3746
##          1 2206 2646
##                                          
##                Accuracy : 0.5337         
##                  95% CI : (0.525, 0.5423)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : 6.234e-14      
##                                          
##                   Kappa : 0.0677         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##               Precision : 0.5453         
##                  Recall : 0.4140         
##                      F1 : 0.4707         
##              Prevalence : 0.5008         
##          Detection Rate : 0.2073         
##    Detection Prevalence : 0.3802         
##       Balanced Accuracy : 0.5338         
##                                          
##        'Positive' Class : 1              
## 
knn5roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(knn5))
auc(knn5roc) # AUC is 0.5338
## Area under the curve: 0.5338

We will test a few more k values individually, as it is not feasible computationally to run through a large sequence of k values to test.

Test k = 15

knn15 <- knn(train = db[db[,83] != 1, 1:81], test = db[db[,83] == 1, 1:81], cl = db[db[,83] != 1, 82], k = 15)
confusionMatrix(as.factor(knn15), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5258, Recall better than last iteration at 0.5724
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3052 2733
##          1 3319 3659
##                                           
##                Accuracy : 0.5258          
##                  95% CI : (0.5171, 0.5345)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 8.513e-09       
##                                           
##                   Kappa : 0.0515          
##                                           
##  Mcnemar's Test P-Value : 5.486e-14       
##                                           
##               Precision : 0.5244          
##                  Recall : 0.5724          
##                      F1 : 0.5473          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2867          
##    Detection Prevalence : 0.5467          
##       Balanced Accuracy : 0.5257          
##                                           
##        'Positive' Class : 1               
## 
knn15roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(knn15))
auc(knn15roc) # AUC = 0.5257
## Area under the curve: 0.5257

Test k = 25

knn25 <- knn(train = db[db[,83] != 1, 1:81], test = db[db[,83] == 1, 1:81], cl = db[db[,83] != 1, 82], k = 25)
confusionMatrix(as.factor(knn25), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5359, Recall = 0.5219
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3504 3056
##          1 2867 3336
##                                           
##                Accuracy : 0.5359          
##                  95% CI : (0.5272, 0.5446)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 1.134e-15       
##                                           
##                   Kappa : 0.0719          
##                                           
##  Mcnemar's Test P-Value : 0.01457         
##                                           
##               Precision : 0.5378          
##                  Recall : 0.5219          
##                      F1 : 0.5297          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2614          
##    Detection Prevalence : 0.4860          
##       Balanced Accuracy : 0.5359          
##                                           
##        'Positive' Class : 1               
## 
knn25roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(knn25))
auc(knn25roc) # AUC = 0.5359
## Area under the curve: 0.5359

Test k = 35

knn35 <- knn(train = db[db[,83] != 1, 1:81], test = db[db[,83] == 1, 1:81], cl = db[db[,83] != 1, 82], k = 35)
confusionMatrix(as.factor(knn35), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5319, Recall = 0.5369
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3357 2960
##          1 3014 3432
##                                           
##                Accuracy : 0.5319          
##                  95% CI : (0.5232, 0.5406)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 1.095e-12       
##                                           
##                   Kappa : 0.0638          
##                                           
##  Mcnemar's Test P-Value : 0.4929          
##                                           
##               Precision : 0.5324          
##                  Recall : 0.5369          
##                      F1 : 0.5347          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2689          
##    Detection Prevalence : 0.5051          
##       Balanced Accuracy : 0.5319          
##                                           
##        'Positive' Class : 1               
## 
knn35roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(knn35))
auc(knn35roc) # AUC = 0.5319
## Area under the curve: 0.5319

Test k = 45

knn45 <- knn(train = db[db[,83] != 1, 1:81], test = db[db[,83] == 1, 1:81], cl = db[db[,83] != 1, 82], k = 45)
confusionMatrix(as.factor(knn45), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5443, Recall = 0.5454
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3461 2906
##          1 2910 3486
##                                          
##                Accuracy : 0.5443         
##                  95% CI : (0.5356, 0.553)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.0886         
##                                          
##  Mcnemar's Test P-Value : 0.9686         
##                                          
##               Precision : 0.5450         
##                  Recall : 0.5454         
##                      F1 : 0.5452         
##              Prevalence : 0.5008         
##          Detection Rate : 0.2731         
##    Detection Prevalence : 0.5011         
##       Balanced Accuracy : 0.5443         
##                                          
##        'Positive' Class : 1              
## 
knn45roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(knn45))
auc(knn45roc) # AUC = 0.5443
## Area under the curve: 0.5443

Test k = 55

knn55 <- knn(train = db[db[,83] != 1, 1:81], test = db[db[,83] == 1, 1:81], cl = db[db[,83] != 1, 82], k = 55)
confusionMatrix(as.factor(knn55), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5481, Recall = 0.5574
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3432 2829
##          1 2939 3563
##                                           
##                Accuracy : 0.5481          
##                  95% CI : (0.5394, 0.5567)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.0961          
##                                           
##  Mcnemar's Test P-Value : 0.1512          
##                                           
##               Precision : 0.5480          
##                  Recall : 0.5574          
##                      F1 : 0.5527          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2792          
##    Detection Prevalence : 0.5094          
##       Balanced Accuracy : 0.5481          
##                                           
##        'Positive' Class : 1               
## 
knn55roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(knn55))
auc(knn55roc) # AUC = 0.5481
## Area under the curve: 0.5481

Test k = 65

knn65 <- knn(train = db[db[,83] != 1, 1:81], test = db[db[,83] == 1, 1:81], cl = db[db[,83] != 1, 82], k = 65)
confusionMatrix(as.factor(knn65), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5504, Recall = 0.5579
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3459 2826
##          1 2912 3566
##                                           
##                Accuracy : 0.5504          
##                  95% CI : (0.5417, 0.5591)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.1008          
##                                           
##  Mcnemar's Test P-Value : 0.2618          
##                                           
##               Precision : 0.5505          
##                  Recall : 0.5579          
##                      F1 : 0.5542          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2794          
##    Detection Prevalence : 0.5076          
##       Balanced Accuracy : 0.5504          
##                                           
##        'Positive' Class : 1               
## 
knn65roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(knn65))
auc(knn65roc) # AUC = 0.5504
## Area under the curve: 0.5504

Let’s increase k values by 20 now, as evaluation metric values are still gradually increasing.

Test k = 85

knn85 <- knn(train = db[db[,83] != 1, 1:81], test = db[db[,83] == 1, 1:81], cl = db[db[,83] != 1, 82], k = 85)
confusionMatrix(as.factor(knn85), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.567, Recall = 0.5853, which is the best yet
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3496 2651
##          1 2875 3741
##                                           
##                Accuracy : 0.567           
##                  95% CI : (0.5584, 0.5756)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.134           
##                                           
##  Mcnemar's Test P-Value : 0.002701        
##                                           
##               Precision : 0.5654          
##                  Recall : 0.5853          
##                      F1 : 0.5752          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2931          
##    Detection Prevalence : 0.5184          
##       Balanced Accuracy : 0.5670          
##                                           
##        'Positive' Class : 1               
## 
knn85roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(knn85))
auc(knn85roc) # AUC = 0.567
## Area under the curve: 0.567

Test k = 105

knn105 <- knn(train = db[db[,83] != 1, 1:81], test = db[db[,83] == 1, 1:81], cl = db[db[,83] != 1, 82], k = 105)
confusionMatrix(as.factor(knn105), as.factor(db[db[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5629, Recall = 0.5670, which is a decrease from the last iteration
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3560 2768
##          1 2811 3624
##                                           
##                Accuracy : 0.5629          
##                  95% CI : (0.5542, 0.5715)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.1257          
##                                           
##  Mcnemar's Test P-Value : 0.5739          
##                                           
##               Precision : 0.5632          
##                  Recall : 0.5670          
##                      F1 : 0.5651          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2839          
##    Detection Prevalence : 0.5042          
##       Balanced Accuracy : 0.5629          
##                                           
##        'Positive' Class : 1               
## 
knn105roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(knn105))
auc(knn105roc) # AUC = 0.5629
## Area under the curve: 0.5629

Since the evaluation metric values started to decrease from k = 85 to k = 105, we will move forward with k = 85, which resulted in the highest accuracy, recall, and AUC so far. We will use cross validation on this model configuration so that we may compare it to the other model types.

kNN Final Model: k = 85

set.seed(31)

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval.knn <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))


# Run algorithm for each group of folds; store evaluation metrics for each fold in a dataframe (eval.knn)
for (f in 1:10){
  knnclass <- knn(train = db[db[,83] != f, 1:81], test = db[db[,83] == f, 1:81], cl = db[db[,83] != f, 82], k = 85)
  knnresult <- confusionMatrix(as.factor(knnclass), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval.knn$accuracy[f] <- knnresult$overall['Accuracy']
  eval.knn$precision[f] <- knnresult$byClass['Precision']
  eval.knn$recall[f] <- knnresult$byClass['Recall']
  eval.knn$f1[f] <- knnresult$byClass['F1']
  knnroc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(knnclass))
  eval.knn$auc[f] <- auc(knnroc)
} # Took about 60 min to run

#Evaluate
eval.knn
##     accuracy precision    recall        f1       auc
## 1  0.5666379 0.5650400 0.5851064 0.5748982 0.5666075
## 2  0.5297512 0.5322235 0.5028812 0.5171365 0.5297911
## 3  0.5465916 0.5491657 0.5291088 0.5389507 0.5466217
## 4  0.5534094 0.5565133 0.5341867 0.5451215 0.5534456
## 5  0.5352564 0.5375610 0.5159981 0.5265589 0.5352896
## 6  0.5327051 0.5345779 0.5163870 0.5253250 0.5327295
## 7  0.5675591 0.5682674 0.5677328 0.5680000 0.5675588
## 8  0.5572225 0.5581102 0.5559216 0.5570138 0.5572245
## 9  0.5433189 0.5458141 0.5216570 0.5334622 0.5433411
## 10 0.5418955 0.5444318 0.5230263 0.5335144 0.5419281
summary(eval.knn) # Mean Accuracy = 0.5474, mean Precision = 0.5492, mean Recall = 0.5352, mean f1 = 0.5420, mean AUC = 0.5475
##     accuracy        precision          recall             f1        
##  Min.   :0.5298   Min.   :0.5322   Min.   :0.5029   Min.   :0.5171  
##  1st Qu.:0.5369   1st Qu.:0.5393   1st Qu.:0.5177   1st Qu.:0.5283  
##  Median :0.5450   Median :0.5475   Median :0.5261   Median :0.5362  
##  Mean   :0.5474   Mean   :0.5492   Mean   :0.5352   Mean   :0.5420  
##  3rd Qu.:0.5563   3rd Qu.:0.5577   3rd Qu.:0.5505   3rd Qu.:0.5540  
##  Max.   :0.5676   Max.   :0.5683   Max.   :0.5851   Max.   :0.5749  
##       auc        
##  Min.   :0.5298  
##  1st Qu.:0.5369  
##  Median :0.5450  
##  Mean   :0.5475  
##  3rd Qu.:0.5563  
##  Max.   :0.5676

Random Forest Model Creation and Evaluation

The Random Forest algorithm creates a number of decision trees, which it then aggregates the results of in order to classify new cases. Parameters to tune with this model include the number of trees (ntree), the number of variables used per tree (mtry), the max number of nodes per tree terminal layer (maxnodes), and the number of data points required for a node to be created (nodesize).

First we need to alter the data frame so that the target attribute (readmitted) is a factor instead of a number. We will also set a new seed here too.

db.rf <- db.over # Copy to a new data frame (the version before it was turned into a matrix)
db.rf$readmitted <- as.factor(db.rf$readmitted) # Convert the readmitted variable to a factor
set.seed(26)

First we will test a baseline model with 100 trees and 9 variables per tree. Note that the OOB (out of bag) estimate of error rate is equal to the number of cases misclassified during training over the total number of cases.

100 Trees, 9 Variables per Tree

rforest1 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=9, importance=F) # Took about 6 min
rforest1 # OOB (out of bag) estimate of  error rate: 0.22%
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 9, importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 0.22%
## Confusion matrix:
##       0     1  class.error
## 0 57077   248 4.326210e-03
## 1     5 57499 8.695047e-05
rf1predict <- predict(rforest1, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf1predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5047. Recall = 0.01392. It appears to have predicted pretty much all cases as 0. Appears to be overfitting based on low OOB error and accuract confusion matrix on training set
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6353 6303
##          1   18   89
##                                          
##                Accuracy : 0.5047         
##                  95% CI : (0.496, 0.5135)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : 0.1904         
##                                          
##                   Kappa : 0.0111         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##               Precision : 0.831776       
##                  Recall : 0.013924       
##                      F1 : 0.027389       
##              Prevalence : 0.500823       
##          Detection Rate : 0.006973       
##    Detection Prevalence : 0.008384       
##       Balanced Accuracy : 0.505549       
##                                          
##        'Positive' Class : 1              
## 
rf1roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf1predict))
auc(rf1roc) # AUC = 0.5055 (basically the same as randomly guessing the class)
## Area under the curve: 0.5055

Let’s try the same thing with 300 trees instead of 100.

300 Trees, 9 Variables per Tree

rforest2 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=300, mtry=9, importance=F) # Took 9 min (it's expected it would take longer than the last attempt)
rforest2 # OOB estimate of error rate: 0.11%; even lower
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 300, mtry = 9, importance = F) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 0.11%
## Confusion matrix:
##       0     1  class.error
## 0 57198   127 2.215438e-03
## 1     3 57501 5.217028e-05
rf2predict <- predict(rforest2, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf2predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5036 - worse, though not by much. Recall = 0.01126, which is worse as well
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6356 6320
##          1   15   72
##                                           
##                Accuracy : 0.5036          
##                  95% CI : (0.4949, 0.5124)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 0.2649          
##                                           
##                   Kappa : 0.0089          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##               Precision : 0.827586        
##                  Recall : 0.011264        
##                      F1 : 0.022226        
##              Prevalence : 0.500823        
##          Detection Rate : 0.005641        
##    Detection Prevalence : 0.006817        
##       Balanced Accuracy : 0.504455        
##                                           
##        'Positive' Class : 1               
## 
rf2roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf2predict))
auc(rf2roc) # AUC = 0.5045
## Area under the curve: 0.5045

Clearly we need to alter some other parameters to address the overfitting. Let’s try to decrease the number of features randomly selected per tree (mtry). We’ll use 100 trees for now since it’s quicker and didn’t have much impact on the last test.

100 Trees, 6 Variables per Tree

rforest3 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=6, importance=F)
rforest3 # OOB estimate of error rate: 0.71%, which is slightly higher. Let's see if it predicts any better
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 6, importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 0.71%
## Confusion matrix:
##       0     1 class.error
## 0 56604   721 0.012577410
## 1    93 57411 0.001617279
rf3predict <- predict(rforest3, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf3predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5098 - barely any better than the original. Recall = 0.02785
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6328 6214
##          1   43  178
##                                          
##                Accuracy : 0.5098         
##                  95% CI : (0.501, 0.5185)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : 0.02225        
##                                          
##                   Kappa : 0.0211         
##                                          
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##               Precision : 0.80543        
##                  Recall : 0.02785        
##                      F1 : 0.05383        
##              Prevalence : 0.50082        
##          Detection Rate : 0.01395        
##    Detection Prevalence : 0.01732        
##       Balanced Accuracy : 0.51055        
##                                          
##        'Positive' Class : 1              
## 
rf3roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf3predict))
auc(rf3roc) # AUC = 0.5105
## Area under the curve: 0.5105

Altering the number of variables per tree didn’t do much. Let’s instead try setting the node size to a larger number (the default is 1) to make the trees smaller and less prone to overfitting.

100 Trees, 6 Variables per Tree, Node Size of 25

rforest4 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=6, nodesize = 25, importance=F)
rforest4 # OOB estimate of error rate: 4.31%; going up
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 6, nodesize = 25, importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 4.31%
## Confusion matrix:
##       0     1 class.error
## 0 54173  3152  0.05498474
## 1  1800 55704  0.03130217
rf4predict <- predict(rforest4, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf4predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5499 - increased. Recall = 0.1408, also increased
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6118 5492
##          1  253  900
##                                           
##                Accuracy : 0.5499          
##                  95% CI : (0.5412, 0.5585)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.101           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.78057         
##                  Recall : 0.14080         
##                      F1 : 0.23857         
##              Prevalence : 0.50082         
##          Detection Rate : 0.07052         
##    Detection Prevalence : 0.09034         
##       Balanced Accuracy : 0.55054         
##                                           
##        'Positive' Class : 1               
## 
rf4roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf4predict))
auc(rf4roc) # AUC = 0.5505
## Area under the curve: 0.5505

Let’s try a node size of 50 instead to reduce overfitting further.

100 Trees, 6 Variables per Tree, Node Size of 50

rforest5 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=6, nodesize = 50, importance=F)
rforest5 # OOB estimate of error rate: 9.6%
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 6, nodesize = 50, importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 9.6%
## Confusion matrix:
##       0     1 class.error
## 0 50545  6780  0.11827300
## 1  4246 53258  0.07383834
rf5predict <- predict(rforest5, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf5predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.5694 - increased. Recall = 0.2380, also increased
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5746 4871
##          1  625 1521
##                                          
##                Accuracy : 0.5694         
##                  95% CI : (0.5607, 0.578)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.1397         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##               Precision : 0.7088         
##                  Recall : 0.2380         
##                      F1 : 0.3563         
##              Prevalence : 0.5008         
##          Detection Rate : 0.1192         
##    Detection Prevalence : 0.1681         
##       Balanced Accuracy : 0.5699         
##                                          
##        'Positive' Class : 1              
## 
rf5roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf5predict))
auc(rf5roc) # AUC = 0.5699
## Area under the curve: 0.5699

Since the dataset is large, and the test set accuracy is increasing with increased node sizes, let’s try nodesize of 100.

100 Trees, 6 Variables per Tree, Node Size of 100

rforest6 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=6, nodesize = 100, importance=F)
rforest6 # OOB estimate of error rate: 17.2%
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 6, nodesize = 100, importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 17.2%
## Confusion matrix:
##       0     1 class.error
## 0 45774 11551   0.2015002
## 1  8202 49302   0.1426336
rf6predict <- predict(rforest6, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf6predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.6063 - increased, Recall = 0.3988, also increased
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5189 3843
##          1 1182 2549
##                                           
##                Accuracy : 0.6063          
##                  95% CI : (0.5977, 0.6148)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2131          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6832          
##                  Recall : 0.3988          
##                      F1 : 0.5036          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1997          
##    Detection Prevalence : 0.2923          
##       Balanced Accuracy : 0.6066          
##                                           
##        'Positive' Class : 1               
## 
rf6roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf6predict))
auc(rf6roc) # AUC = 0.6066 - increased
## Area under the curve: 0.6066

Since the test accuracy, recall, and AUC are increasing with increased node size, let’s try 200 now.

100 Trees, 6 Variables per Tree, Node Size of 200

rforest7 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=6, nodesize = 200, importance=F)
rforest7 # OOB estimate oferror rate: 24.49%. Continuing to increase, indicating that training performance is decreasing. This is fine, as long as the test set is doing well
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 6, nodesize = 200, importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 24.49%
## Confusion matrix:
##       0     1 class.error
## 0 41665 15660   0.2731792
## 1 12466 45038   0.2167849
rf7predict <- predict(rforest7, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf7predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.6129 - increased, Recall = 0.4912, also increased
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4682 3252
##          1 1689 3140
##                                           
##                Accuracy : 0.6129          
##                  95% CI : (0.6044, 0.6213)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.226           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6502          
##                  Recall : 0.4912          
##                      F1 : 0.5597          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2460          
##    Detection Prevalence : 0.3784          
##       Balanced Accuracy : 0.6131          
##                                           
##        'Positive' Class : 1               
## 
rf7roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf7predict))
auc(rf7roc) # AUC = 0.6131
## Area under the curve: 0.6131

Let’s try a node size of 500 now.

100 Trees, 6 Variables per Tree, Node Size of 500

rforest8 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=6, nodesize = 500, importance=F)
rforest8 # OOB estimate of error rate: 32.84%
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 6, nodesize = 500, importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 32.84%
## Confusion matrix:
##       0     1 class.error
## 0 38039 19286   0.3364326
## 1 18421 39083   0.3203429
rf8predict <- predict(rforest8, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf8predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.6118 - similar to last, but slightly lower. It seems to have hit a point of diminishing returns. Recall is 0.5601 though, which is a big improvement
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4228 2812
##          1 2143 3580
##                                           
##                Accuracy : 0.6118          
##                  95% CI : (0.6032, 0.6202)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2237          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6255          
##                  Recall : 0.5601          
##                      F1 : 0.5910          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2805          
##    Detection Prevalence : 0.4484          
##       Balanced Accuracy : 0.6119          
##                                           
##        'Positive' Class : 1               
## 
rf8roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf8predict))
auc(rf8roc) # AUC = 0.6119
## Area under the curve: 0.6119

Let’s keep the node size of 500 for now. Let’s try setting the maximum number of terminal nodes (maxnodes) per tree to 0.3 times the sample size, which is about 34000. The default maxnodes is the maximum possible allowable by the node size.

100 Trees, 6 Variables per Tree, Node Size of 500, Max Nodes of 34000

rforest9 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=6, nodesize = 500, maxnodes = 34000, importance=F)
rforest9 # OOB estimate of error rate: 32.44%
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 6, nodesize = 500, maxnodes = 34000,      importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 32.44%
## Confusion matrix:
##       0     1 class.error
## 0 38335 18990   0.3312691
## 1 18259 39245   0.3175257
rf9predict <- predict(rforest9, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf9predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.6156 - increased, Recall = 0.5584, which is slightly decreased
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4288 2823
##          1 2083 3569
##                                           
##                Accuracy : 0.6156          
##                  95% CI : (0.6071, 0.6241)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2314          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6315          
##                  Recall : 0.5584          
##                      F1 : 0.5927          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2796          
##    Detection Prevalence : 0.4428          
##       Balanced Accuracy : 0.6157          
##                                           
##        'Positive' Class : 1               
## 
rf9roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf9predict))
auc(rf9roc) # AUC = 0.6157
## Area under the curve: 0.6157

Let’s try decreasing the maxnodes even further, to 17000.

100 Trees, 6 Variables per Tree, Node Size of 500, Max Nodes of 17000

rforest10 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=6, nodesize = 500, maxnodes = 17000, importance=F)
rforest10 # OOB estimate of error rate: 32.64%
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 6, nodesize = 500, maxnodes = 17000,      importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 32.64%
## Confusion matrix:
##       0     1 class.error
## 0 37828 19497   0.3401134
## 1 17979 39525   0.3126565
rf10predict <- predict(rforest10, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf10predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.6175 - increased slightly. Recall = 0.5743, increased. We'll keep the maxnodes at 17000 then
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4210 2721
##          1 2161 3671
##                                          
##                Accuracy : 0.6175         
##                  95% CI : (0.609, 0.6259)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2351         
##                                          
##  Mcnemar's Test P-Value : 1.24e-15       
##                                          
##               Precision : 0.6295         
##                  Recall : 0.5743         
##                      F1 : 0.6006         
##              Prevalence : 0.5008         
##          Detection Rate : 0.2876         
##    Detection Prevalence : 0.4569         
##       Balanced Accuracy : 0.6176         
##                                          
##        'Positive' Class : 1              
## 
rf10roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf10predict))
auc(rf10roc) # AUC = 0.6176 - increased slightly as well
## Area under the curve: 0.6176

We’ll keep the maxnodes at 17000, as there was a slight increase in accuracy, recall, and AUC from the previous iteration. Let’s try one last change to the number of variables per tree (mtry) to see if it makes a positive difference. We’ll decrease the number from 6 to 4.

100 Trees, 4 Variables per Tree, Node Size of 500, Max Nodes of 17000

rforest11 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=4, nodesize = 500, maxnodes = 17000, importance=F)
rforest11 # OOB estimate of error rate: 33.82%
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 100, mtry = 4, nodesize = 500, maxnodes = 17000,      importance = F) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 33.82%
## Confusion matrix:
##       0     1 class.error
## 0 37463 19862   0.3464806
## 1 18971 38533   0.3299075
rf11predict <- predict(rforest11, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf11predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.6064, Recall = 0.5616. The change didn't appear to produce an improvement. We'll keep mtry at 6
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4149 2802
##          1 2222 3590
##                                           
##                Accuracy : 0.6064          
##                  95% CI : (0.5978, 0.6149)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2128          
##                                           
##  Mcnemar's Test P-Value : 3.117e-16       
##                                           
##               Precision : 0.6177          
##                  Recall : 0.5616          
##                      F1 : 0.5883          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2813          
##    Detection Prevalence : 0.4554          
##       Balanced Accuracy : 0.6064          
##                                           
##        'Positive' Class : 1               
## 
rf11roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf11predict))
auc(rf11roc) # AUC = 0.6064
## Area under the curve: 0.6064

There didn’t appear to be any improvement with changing the number of variables per tree to 4, so we’ll keep it at 6. With those parameters, let’s try again using 500 trees to see if it’s feasible and worthwhile to run 10-fold cross validation on this.

500 Trees, 6 Variables per Tree, Node Size of 500, Max Nodes of 17000

rforest12 <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=500, mtry=6, nodesize = 500, maxnodes = 17000, importance=F) # Took about 15 min
rforest12 # OOB estimate of error rate: 32.26%
## 
## Call:
##  randomForest(formula = readmitted ~ ., data = db.rf[db.rf[, 83] !=      1, 1:82], ntree = 500, mtry = 6, nodesize = 500, maxnodes = 17000,      importance = F) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 32.36%
## Confusion matrix:
##       0     1 class.error
## 0 37982 19343   0.3374270
## 1 17817 39687   0.3098393
rf12predict <- predict(rforest12, db.rf[db.rf[,83] == 1, 1:81])
confusionMatrix(as.factor(rf12predict), as.factor(db.rf[db.rf[,83] == 1, 82]), mode="prec_recall", positive = "1") # Accuracy = 0.61, Recall = 0.5620 - very close to same metrics as 100 trees case. Will stick with 100 trees for 10-fold cross validation for feasibility
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4194 2800
##          1 2177 3592
##                                           
##                Accuracy : 0.61            
##                  95% CI : (0.6015, 0.6185)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2202          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6226          
##                  Recall : 0.5620          
##                      F1 : 0.5907          
##              Prevalence : 0.5008          
##          Detection Rate : 0.2814          
##    Detection Prevalence : 0.4520          
##       Balanced Accuracy : 0.6101          
##                                           
##        'Positive' Class : 1               
## 
rf12roc <- roc(as.numeric(db.rf[db.rf[,83] == 1, 82]), as.numeric(rf12predict))
auc(rf12roc) # AUC = 0.6101
## Area under the curve: 0.6101

For the final random forest model that we will use 10-fold cross validation on, we will stick with 6 features per tree (mtry), 100 trees (ntree), a minimum node size of 500, and a maximum of 17000 terminal nodes per tree (maxnodes).

Random Forest Final Model: 100 Trees, 6 Variables per Tree, Node Size of 500, Max Nodes of 17000

set.seed(26) 

# Create empty dataframe to house evaluation measures
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval.rf <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Run algorithm for each group of folds; store evaluation metrics for each fold in a dataframe (eval.rf)
for (f in 1:10){
  rfclass <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != f, 1:82], ntree=100, mtry=6, nodesize = 500, maxnodes = 17000, importance=F)
  rfpredict <- predict(rfclass, db.rf[db.rf[,83] == f, 1:81])
  rfresult <- confusionMatrix(as.factor(rfpredict), as.factor(db[db[,83] == f, 82]), mode="prec_recall", positive = "1")
  eval.rf$accuracy[f] <- rfresult$overall['Accuracy']
  eval.rf$precision[f] <- rfresult$byClass['Precision']
  eval.rf$recall[f] <- rfresult$byClass['Recall']
  eval.rf$f1[f] <- rfresult$byClass['F1']
  rfroc <- roc(as.numeric(db[db[,83] == f, 82]), as.numeric(rfpredict))
  eval.rf$auc[f] <- auc(rfroc)
} # Took about 25 min to run

#Evaluate
eval.rf
##     accuracy precision    recall        f1       auc
## 1  0.6221108 0.6350491 0.5771277 0.6047045 0.6221849
## 2  0.5965843 0.6076976 0.5483570 0.5765043 0.5966558
## 3  0.6099124 0.6215892 0.5653192 0.5921203 0.6099893
## 4  0.6144823 0.6288858 0.5621293 0.5936367 0.6145808
## 5  0.6025641 0.6158291 0.5489309 0.5804588 0.6026565
## 6  0.5871221 0.6005752 0.5239141 0.5596315 0.5872165
## 7  0.6084635 0.6180214 0.5710254 0.5935947 0.6085195
## 8  0.6118137 0.6249782 0.5620392 0.5918401 0.6118881
## 9  0.5888845 0.6042663 0.5175618 0.5575634 0.5889577
## 10 0.6110937 0.6242815 0.5614035 0.5911753 0.6111796
summary(eval.rf) # Mean metrics are: Accuracy = 0.6053, Precision = 0.6181, Recall = 0.5538, f1 = 0.5841, AUC = 0.6054
##     accuracy        precision          recall             f1        
##  Min.   :0.5871   Min.   :0.6006   Min.   :0.5176   Min.   :0.5576  
##  1st Qu.:0.5981   1st Qu.:0.6097   1st Qu.:0.5485   1st Qu.:0.5775  
##  Median :0.6092   Median :0.6198   Median :0.5617   Median :0.5915  
##  Mean   :0.6053   Mean   :0.6181   Mean   :0.5538   Mean   :0.5841  
##  3rd Qu.:0.6116   3rd Qu.:0.6248   3rd Qu.:0.5645   3rd Qu.:0.5932  
##  Max.   :0.6221   Max.   :0.6350   Max.   :0.5771   Max.   :0.6047  
##       auc        
##  Min.   :0.5872  
##  1st Qu.:0.5982  
##  Median :0.6093  
##  Mean   :0.6054  
##  3rd Qu.:0.6117  
##  Max.   :0.6222

Logistic Regression Model Creation and Evaluation

Logistic regression is a generalized linear model which uses probabilities to classify cases. The only parameters to change here are the variables inputted to the model. No seed is needed as results are not based on any random number usage.

db.lr <- db.over # Create a data frame specific to the model so that the target variable (readmitted) is a factor
db.lr$readmitted <- as.factor(db.lr$readmitted)

First we will try running a logistic regression model using all variables.

Test Model 1: All Variables

lr1 <- glm(formula = readmitted ~ ., data = db.lr[db.lr[,83] != 1, 1:82], family = 'binomial')
summary(lr1) # Output indicates that 15 coefficients are not defined due to singularities. This is expected, as the dummy variables created for categorical variables matched the number of categories. E.g., A1C result had 3 categories (>7, None, Norm), and was turned into 3 dummy variables. If two of them are 0 for a case, the other one has to be 1. Therefore, these would be identified as being perfectly colinear. AIC = 151615
## 
## Call:
## glm(formula = readmitted ~ ., family = "binomial", data = db.lr[db.lr[, 
##     83] != 1, 1:82])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5741  -1.0877   0.2974   1.1575   1.7425  
## 
## Coefficients: (15 not defined because of singularities)
##                                                 Estimate Std. Error z value
## (Intercept)                                    -2.522840   0.635405  -3.970
## time_in_hospital                                0.320820   0.033480   9.583
## num_lab_procedures                              0.323971   0.047275   6.853
## num_procedures                                 -0.032223   0.024946  -1.292
## num_medications                                 0.112984   0.081200   1.391
## number_outpatient                              -0.105057   0.248310  -0.423
## number_emergency                                5.408048   0.536009  10.089
## number_inpatient                                4.272569   0.118141  36.165
## number_diagnoses                                0.695052   0.055046  12.627
## age                                             0.543342   0.041088  13.224
## A1Cresult_.7                                   -0.025415   0.032136  -0.791
## A1Cresult_None                                  0.103549   0.028106   3.684
## A1Cresult_Norm                                        NA         NA      NA
## race_Caucasian                                  0.194172   0.051085   3.801
## race_AfricanAmerican                            0.229182   0.052670   4.351
## race_Asian                                      0.149760   0.090573   1.653
## race_Hispanic                                   0.138163   0.066337   2.083
## race_Other                                            NA         NA      NA
## gender                                          0.016045   0.012463   1.287
## admission_type_id_Emergency                     1.070277   0.417172   2.566
## admission_type_id_Urgent                        1.028566   0.417387   2.464
## admission_type_id_Elective                      0.949809   0.417618   2.274
## admission_type_id_Other                               NA         NA      NA
## discharge_disposition_id_DcHome                -0.612703   0.069166  -8.858
## discharge_disposition_id_DcOtherFacility        0.366834   0.072059   5.091
## discharge_disposition_id_DcSNF                 -0.045754   0.071049  -0.644
## discharge_disposition_id_DcHomeWithHomeService -0.443814   0.071150  -6.238
## discharge_disposition_id_Other                        NA         NA      NA
## admission_source_id_PhysRef                     0.586019   0.401438   1.460
## admission_source_id_TransferExtFacility         0.434062   0.401847   1.080
## admission_source_id_ER                          0.478692   0.401472   1.192
## admission_source_id_Other                             NA         NA      NA
## diag_1_390.459                                  0.134692   0.026334   5.115
## diag_1_Other                                    0.038371   0.026788   1.432
## diag_1_250.0.250.9                              0.140047   0.033230   4.214
## diag_1_460.519                                 -0.110781   0.031743  -3.490
## diag_1_520.579                                 -0.005098   0.031520  -0.162
## diag_1_710.739                                 -0.195342   0.035717  -5.469
## diag_1_780.799                                 -0.229359   0.033953  -6.755
## diag_1_800.999                                        NA         NA      NA
## diag_2_390.459                                  0.075858   0.025696   2.952
## diag_2_Other                                    0.069745   0.025662   2.718
## diag_2_240.279.not250.                         -0.002908   0.031867  -0.091
## diag_2_250.0.250.9                              0.276176   0.029502   9.361
## diag_2_460.519                                 -0.065248   0.030994  -2.105
## diag_2_580.629                                        NA         NA      NA
## diag_3_390.459                                 -0.023828   0.028410  -0.839
## diag_3_Other                                    0.032158   0.028450   1.130
## diag_3_240.279.not250.                         -0.119066   0.033376  -3.567
## diag_3_250.0.250.9                              0.115604   0.030559   3.783
## diag_3_460.519                                  0.068973   0.035443   1.946
## diag_3_580.629                                        NA         NA      NA
## metformin_Down                                  0.427515   0.097355   4.391
## metformin_No                                    0.276933   0.062047   4.463
## metformin_Steady                                0.223694   0.062057   3.605
## metformin_Up                                          NA         NA      NA
## glimepiride_Down                                0.017024   0.169165   0.101
## glimepiride_No                                 -0.268440   0.103411  -2.596
## glimepiride_Steady                             -0.352759   0.106143  -3.323
## glimepiride_Up                                        NA         NA      NA
## glipizide_Down                                  0.247926   0.102011   2.430
## glipizide_No                                   -0.051980   0.067398  -0.771
## glipizide_Steady                               -0.020455   0.068160  -0.300
## glipizide_Up                                          NA         NA      NA
## glyburide_Down                                 -0.239963   0.101676  -2.360
## glyburide_No                                   -0.099301   0.065529  -1.515
## glyburide_Steady                               -0.058381   0.066125  -0.883
## glyburide_Up                                          NA         NA      NA
## pioglitazone_Down                               0.085270   0.201007   0.424
## pioglitazone_No                                 0.070345   0.119828   0.587
## pioglitazone_Steady                            -0.030944   0.121270  -0.255
## pioglitazone_Up                                       NA         NA      NA
## rosiglitazone_Down                             -0.541213   0.250069  -2.164
## rosiglitazone_No                               -0.037694   0.136541  -0.276
## rosiglitazone_Steady                           -0.104746   0.138310  -0.757
## rosiglitazone_Up                                      NA         NA      NA
## insulin_Down                                    0.070874   0.026631   2.661
## insulin_No                                     -0.031473   0.032747  -0.961
## insulin_Steady                                 -0.081090   0.025924  -3.128
## insulin_Up                                            NA         NA      NA
## change                                         -0.075063   0.022775  -3.296
## diabetesMed                                     0.255802   0.022098  11.576
##                                                Pr(>|z|)    
## (Intercept)                                    7.17e-05 ***
## time_in_hospital                                < 2e-16 ***
## num_lab_procedures                             7.23e-12 ***
## num_procedures                                 0.196456    
## num_medications                                0.164094    
## number_outpatient                              0.672230    
## number_emergency                                < 2e-16 ***
## number_inpatient                                < 2e-16 ***
## number_diagnoses                                < 2e-16 ***
## age                                             < 2e-16 ***
## A1Cresult_.7                                   0.429025    
## A1Cresult_None                                 0.000229 ***
## A1Cresult_Norm                                       NA    
## race_Caucasian                                 0.000144 ***
## race_AfricanAmerican                           1.35e-05 ***
## race_Asian                                     0.098235 .  
## race_Hispanic                                  0.037275 *  
## race_Other                                           NA    
## gender                                         0.197939    
## admission_type_id_Emergency                    0.010301 *  
## admission_type_id_Urgent                       0.013728 *  
## admission_type_id_Elective                     0.022945 *  
## admission_type_id_Other                              NA    
## discharge_disposition_id_DcHome                 < 2e-16 ***
## discharge_disposition_id_DcOtherFacility       3.57e-07 ***
## discharge_disposition_id_DcSNF                 0.519591    
## discharge_disposition_id_DcHomeWithHomeService 4.44e-10 ***
## discharge_disposition_id_Other                       NA    
## admission_source_id_PhysRef                    0.144345    
## admission_source_id_TransferExtFacility        0.280068    
## admission_source_id_ER                         0.233127    
## admission_source_id_Other                            NA    
## diag_1_390.459                                 3.14e-07 ***
## diag_1_Other                                   0.152020    
## diag_1_250.0.250.9                             2.50e-05 ***
## diag_1_460.519                                 0.000483 ***
## diag_1_520.579                                 0.871503    
## diag_1_710.739                                 4.52e-08 ***
## diag_1_780.799                                 1.43e-11 ***
## diag_1_800.999                                       NA    
## diag_2_390.459                                 0.003156 ** 
## diag_2_Other                                   0.006570 ** 
## diag_2_240.279.not250.                         0.927292    
## diag_2_250.0.250.9                              < 2e-16 ***
## diag_2_460.519                                 0.035277 *  
## diag_2_580.629                                       NA    
## diag_3_390.459                                 0.401626    
## diag_3_Other                                   0.258339    
## diag_3_240.279.not250.                         0.000360 ***
## diag_3_250.0.250.9                             0.000155 ***
## diag_3_460.519                                 0.051654 .  
## diag_3_580.629                                       NA    
## metformin_Down                                 1.13e-05 ***
## metformin_No                                   8.07e-06 ***
## metformin_Steady                               0.000313 ***
## metformin_Up                                         NA    
## glimepiride_Down                               0.919840    
## glimepiride_No                                 0.009436 ** 
## glimepiride_Steady                             0.000889 ***
## glimepiride_Up                                       NA    
## glipizide_Down                                 0.015083 *  
## glipizide_No                                   0.440558    
## glipizide_Steady                               0.764094    
## glipizide_Up                                         NA    
## glyburide_Down                                 0.018271 *  
## glyburide_No                                   0.129679    
## glyburide_Steady                               0.377295    
## glyburide_Up                                         NA    
## pioglitazone_Down                              0.671411    
## pioglitazone_No                                0.557171    
## pioglitazone_Steady                            0.798592    
## pioglitazone_Up                                      NA    
## rosiglitazone_Down                             0.030445 *  
## rosiglitazone_No                               0.782497    
## rosiglitazone_Steady                           0.448854    
## rosiglitazone_Up                                     NA    
## insulin_Down                                   0.007784 ** 
## insulin_No                                     0.336506    
## insulin_Steady                                 0.001760 ** 
## insulin_Up                                           NA    
## change                                         0.000981 ***
## diabetesMed                                     < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 159187  on 114828  degrees of freedom
## Residual deviance: 151481  on 114762  degrees of freedom
## AIC: 151615
## 
## Number of Fisher Scoring iterations: 4

Let’s remove one dummy variable from each categorical variable to remove any multicollinearity and clean up the output (the model just applied NA to one of each).

Test Model 2: Remove Variables Causing Multicollinearity

db.lr2 <- db.lr[,c(1:11, 13:16, 18:21, 23:26, 28:30, 32:38, 40:44, 46:50, 52:54, 56:58, 60:62, 64:66, 68:70, 72:74, 76:78, 80:83)] # Let's create a new data frame without the final dummy variable for each categorical variable, and re-run it through the logistic regression to clean it up

lr2 <- glm(formula = readmitted ~ ., data = db.lr2[db.lr2[,68] != 1, 1:67], family = 'binomial')
summary(lr2) # Same AIC (151615), but cleaner. There are a number of variables that are not identified as being statistically significant. Let's try the prediction for now anyways
## 
## Call:
## glm(formula = readmitted ~ ., family = "binomial", data = db.lr2[db.lr2[, 
##     68] != 1, 1:67])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5741  -1.0877   0.2974   1.1575   1.7425  
## 
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                    -2.522840   0.635405  -3.970
## time_in_hospital                                0.320820   0.033480   9.583
## num_lab_procedures                              0.323971   0.047275   6.853
## num_procedures                                 -0.032223   0.024946  -1.292
## num_medications                                 0.112984   0.081200   1.391
## number_outpatient                              -0.105057   0.248310  -0.423
## number_emergency                                5.408048   0.536009  10.089
## number_inpatient                                4.272569   0.118141  36.165
## number_diagnoses                                0.695052   0.055046  12.627
## age                                             0.543342   0.041088  13.224
## A1Cresult_.7                                   -0.025415   0.032136  -0.791
## A1Cresult_None                                  0.103549   0.028106   3.684
## race_Caucasian                                  0.194172   0.051085   3.801
## race_AfricanAmerican                            0.229182   0.052670   4.351
## race_Asian                                      0.149760   0.090573   1.653
## race_Hispanic                                   0.138163   0.066337   2.083
## gender                                          0.016045   0.012463   1.287
## admission_type_id_Emergency                     1.070277   0.417172   2.566
## admission_type_id_Urgent                        1.028566   0.417387   2.464
## admission_type_id_Elective                      0.949809   0.417618   2.274
## discharge_disposition_id_DcHome                -0.612703   0.069166  -8.858
## discharge_disposition_id_DcOtherFacility        0.366834   0.072059   5.091
## discharge_disposition_id_DcSNF                 -0.045754   0.071049  -0.644
## discharge_disposition_id_DcHomeWithHomeService -0.443814   0.071150  -6.238
## admission_source_id_PhysRef                     0.586019   0.401438   1.460
## admission_source_id_TransferExtFacility         0.434062   0.401847   1.080
## admission_source_id_ER                          0.478692   0.401472   1.192
## diag_1_390.459                                  0.134692   0.026334   5.115
## diag_1_Other                                    0.038371   0.026788   1.432
## diag_1_250.0.250.9                              0.140047   0.033230   4.214
## diag_1_460.519                                 -0.110781   0.031743  -3.490
## diag_1_520.579                                 -0.005098   0.031520  -0.162
## diag_1_710.739                                 -0.195342   0.035717  -5.469
## diag_1_780.799                                 -0.229359   0.033953  -6.755
## diag_2_390.459                                  0.075858   0.025696   2.952
## diag_2_Other                                    0.069745   0.025662   2.718
## diag_2_240.279.not250.                         -0.002908   0.031867  -0.091
## diag_2_250.0.250.9                              0.276176   0.029502   9.361
## diag_2_460.519                                 -0.065248   0.030994  -2.105
## diag_3_390.459                                 -0.023828   0.028410  -0.839
## diag_3_Other                                    0.032158   0.028450   1.130
## diag_3_240.279.not250.                         -0.119066   0.033376  -3.567
## diag_3_250.0.250.9                              0.115604   0.030559   3.783
## diag_3_460.519                                  0.068973   0.035443   1.946
## metformin_Down                                  0.427515   0.097355   4.391
## metformin_No                                    0.276933   0.062047   4.463
## metformin_Steady                                0.223694   0.062057   3.605
## glimepiride_Down                                0.017024   0.169165   0.101
## glimepiride_No                                 -0.268440   0.103411  -2.596
## glimepiride_Steady                             -0.352759   0.106143  -3.323
## glipizide_Down                                  0.247926   0.102011   2.430
## glipizide_No                                   -0.051980   0.067398  -0.771
## glipizide_Steady                               -0.020455   0.068160  -0.300
## glyburide_Down                                 -0.239963   0.101676  -2.360
## glyburide_No                                   -0.099301   0.065529  -1.515
## glyburide_Steady                               -0.058381   0.066125  -0.883
## pioglitazone_Down                               0.085270   0.201007   0.424
## pioglitazone_No                                 0.070345   0.119828   0.587
## pioglitazone_Steady                            -0.030944   0.121270  -0.255
## rosiglitazone_Down                             -0.541213   0.250069  -2.164
## rosiglitazone_No                               -0.037694   0.136541  -0.276
## rosiglitazone_Steady                           -0.104746   0.138310  -0.757
## insulin_Down                                    0.070874   0.026631   2.661
## insulin_No                                     -0.031473   0.032747  -0.961
## insulin_Steady                                 -0.081090   0.025924  -3.128
## change                                         -0.075063   0.022775  -3.296
## diabetesMed                                     0.255802   0.022098  11.576
##                                                Pr(>|z|)    
## (Intercept)                                    7.17e-05 ***
## time_in_hospital                                < 2e-16 ***
## num_lab_procedures                             7.23e-12 ***
## num_procedures                                 0.196456    
## num_medications                                0.164094    
## number_outpatient                              0.672230    
## number_emergency                                < 2e-16 ***
## number_inpatient                                < 2e-16 ***
## number_diagnoses                                < 2e-16 ***
## age                                             < 2e-16 ***
## A1Cresult_.7                                   0.429025    
## A1Cresult_None                                 0.000229 ***
## race_Caucasian                                 0.000144 ***
## race_AfricanAmerican                           1.35e-05 ***
## race_Asian                                     0.098235 .  
## race_Hispanic                                  0.037275 *  
## gender                                         0.197939    
## admission_type_id_Emergency                    0.010301 *  
## admission_type_id_Urgent                       0.013728 *  
## admission_type_id_Elective                     0.022945 *  
## discharge_disposition_id_DcHome                 < 2e-16 ***
## discharge_disposition_id_DcOtherFacility       3.57e-07 ***
## discharge_disposition_id_DcSNF                 0.519591    
## discharge_disposition_id_DcHomeWithHomeService 4.44e-10 ***
## admission_source_id_PhysRef                    0.144345    
## admission_source_id_TransferExtFacility        0.280068    
## admission_source_id_ER                         0.233127    
## diag_1_390.459                                 3.14e-07 ***
## diag_1_Other                                   0.152020    
## diag_1_250.0.250.9                             2.50e-05 ***
## diag_1_460.519                                 0.000483 ***
## diag_1_520.579                                 0.871503    
## diag_1_710.739                                 4.52e-08 ***
## diag_1_780.799                                 1.43e-11 ***
## diag_2_390.459                                 0.003156 ** 
## diag_2_Other                                   0.006570 ** 
## diag_2_240.279.not250.                         0.927292    
## diag_2_250.0.250.9                              < 2e-16 ***
## diag_2_460.519                                 0.035277 *  
## diag_3_390.459                                 0.401626    
## diag_3_Other                                   0.258339    
## diag_3_240.279.not250.                         0.000360 ***
## diag_3_250.0.250.9                             0.000155 ***
## diag_3_460.519                                 0.051654 .  
## metformin_Down                                 1.13e-05 ***
## metformin_No                                   8.07e-06 ***
## metformin_Steady                               0.000313 ***
## glimepiride_Down                               0.919840    
## glimepiride_No                                 0.009436 ** 
## glimepiride_Steady                             0.000889 ***
## glipizide_Down                                 0.015083 *  
## glipizide_No                                   0.440558    
## glipizide_Steady                               0.764094    
## glyburide_Down                                 0.018271 *  
## glyburide_No                                   0.129679    
## glyburide_Steady                               0.377295    
## pioglitazone_Down                              0.671411    
## pioglitazone_No                                0.557171    
## pioglitazone_Steady                            0.798592    
## rosiglitazone_Down                             0.030445 *  
## rosiglitazone_No                               0.782497    
## rosiglitazone_Steady                           0.448854    
## insulin_Down                                   0.007784 ** 
## insulin_No                                     0.336506    
## insulin_Steady                                 0.001760 ** 
## change                                         0.000981 ***
## diabetesMed                                     < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 159187  on 114828  degrees of freedom
## Residual deviance: 151481  on 114762  degrees of freedom
## AIC: 151615
## 
## Number of Fisher Scoring iterations: 4
lr2prob <- predict(lr2, db.lr2[db.lr2[,68] == 1, 1:66])
lr2predict <- ifelse(lr2prob>=0.5, 1, 0) # Anything with a 0.5 probability and over will be classified as 1. otherwise it will be classified as 0
confusionMatrix(as.factor(lr2predict), db.lr2[db.lr2[,68] == 1, 67], mode="prec_recall", positive = "1") # Accuracy = 0.5778; Higher precision at 0.6984 than recall (0.2764), indicating that the model favoured false negatives over false positives
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5608 4625
##          1  763 1767
##                                           
##                Accuracy : 0.5778          
##                  95% CI : (0.5692, 0.5864)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1565          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6984          
##                  Recall : 0.2764          
##                      F1 : 0.3961          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1384          
##    Detection Prevalence : 0.1982          
##       Balanced Accuracy : 0.5783          
##                                           
##        'Positive' Class : 1               
## 
lr2roc <- roc(as.numeric(db.lr2[db.lr2[,68] == 1, 67]), as.numeric(lr2predict))
auc(lr2roc) # AUC = 0.5783
## Area under the curve: 0.5783

Let’s remove any variable with a p-value greater than 0.05 for the model to see its effects

Test Model 3: Remove Variables that are Not Statistically Significant

db.lr3 <- db.lr2[,c(1:2, 6:9, 11:13, 15, 17:21, 23, 27, 29:30, 32:35, 37:38, 41:42, 44:46, 48:50, 53, 59, 62, 64:68)] # Remove variables that were not statistically significant in the last model

lr3 <- glm(formula = readmitted ~ ., data = db.lr3[db.lr3[,41] != 1, 1:40], family = 'binomial')
summary(lr3) # AIC = 151673, which is actually slightly higher than the last model (which we don't want); also, race_Hispanic is no longer considered statistically significant
## 
## Call:
## glm(formula = readmitted ~ ., family = "binomial", data = db.lr3[db.lr3[, 
##     41] != 1, 1:40])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5431  -1.0877   0.2966   1.1589   1.7424  
## 
## Coefficients:
##                                                Estimate Std. Error z value
## (Intercept)                                    -2.22652    0.43307  -5.141
## time_in_hospital                                0.33088    0.03028  10.927
## num_lab_procedures                              0.33200    0.04614   7.195
## number_emergency                                5.30452    0.53073   9.995
## number_inpatient                                4.27107    0.11760  36.318
## number_diagnoses                                0.71576    0.05332  13.423
## age                                             0.52661    0.03997  13.175
## A1Cresult_None                                  0.11678    0.01697   6.880
## race_Caucasian                                  0.14764    0.04240   3.482
## race_AfricanAmerican                            0.18087    0.04432   4.081
## race_Hispanic                                   0.09594    0.05988   1.602
## admission_type_id_Emergency                     1.12525    0.41625   2.703
## admission_type_id_Urgent                        1.12616    0.41644   2.704
## admission_type_id_Elective                      1.08685    0.41647   2.610
## discharge_disposition_id_DcHome                -0.57055    0.01869 -30.532
## discharge_disposition_id_DcOtherFacility        0.40686    0.02590  15.710
## discharge_disposition_id_DcHomeWithHomeService -0.39732    0.02330 -17.053
## diag_1_390.459                                  0.10214    0.01565   6.529
## diag_1_250.0.250.9                              0.11983    0.02512   4.770
## diag_1_460.519                                 -0.12943    0.02334  -5.544
## diag_1_710.739                                 -0.20578    0.02904  -7.085
## diag_1_780.799                                 -0.26189    0.02637  -9.931
## diag_2_390.459                                  0.07303    0.01984   3.682
## diag_2_Other                                    0.07139    0.01954   3.653
## diag_2_250.0.250.9                              0.27174    0.02442  11.127
## diag_2_460.519                                 -0.05881    0.02631  -2.235
## diag_3_240.279.not250.                         -0.12955    0.02214  -5.852
## diag_3_250.0.250.9                              0.10747    0.01714   6.269
## metformin_Down                                  0.45666    0.09706   4.705
## metformin_No                                    0.28348    0.06139   4.618
## metformin_Steady                                0.23024    0.06192   3.719
## glimepiride_No                                 -0.25127    0.08202  -3.063
## glimepiride_Steady                             -0.35779    0.08630  -4.146
## glipizide_Down                                  0.27497    0.07883   3.488
## glyburide_Down                                 -0.16863    0.08075  -2.088
## rosiglitazone_Down                             -0.47452    0.21023  -2.257
## insulin_Down                                    0.08417    0.02292   3.673
## insulin_Steady                                 -0.06213    0.01525  -4.074
## change                                         -0.06763    0.01598  -4.233
## diabetesMed                                     0.26400    0.01887  13.989
##                                                Pr(>|z|)    
## (Intercept)                                    2.73e-07 ***
## time_in_hospital                                < 2e-16 ***
## num_lab_procedures                             6.22e-13 ***
## number_emergency                                < 2e-16 ***
## number_inpatient                                < 2e-16 ***
## number_diagnoses                                < 2e-16 ***
## age                                             < 2e-16 ***
## A1Cresult_None                                 5.99e-12 ***
## race_Caucasian                                 0.000497 ***
## race_AfricanAmerican                           4.49e-05 ***
## race_Hispanic                                  0.109135    
## admission_type_id_Emergency                    0.006866 ** 
## admission_type_id_Urgent                       0.006845 ** 
## admission_type_id_Elective                     0.009063 ** 
## discharge_disposition_id_DcHome                 < 2e-16 ***
## discharge_disposition_id_DcOtherFacility        < 2e-16 ***
## discharge_disposition_id_DcHomeWithHomeService  < 2e-16 ***
## diag_1_390.459                                 6.64e-11 ***
## diag_1_250.0.250.9                             1.84e-06 ***
## diag_1_460.519                                 2.95e-08 ***
## diag_1_710.739                                 1.39e-12 ***
## diag_1_780.799                                  < 2e-16 ***
## diag_2_390.459                                 0.000232 ***
## diag_2_Other                                   0.000260 ***
## diag_2_250.0.250.9                              < 2e-16 ***
## diag_2_460.519                                 0.025385 *  
## diag_3_240.279.not250.                         4.86e-09 ***
## diag_3_250.0.250.9                             3.63e-10 ***
## metformin_Down                                 2.54e-06 ***
## metformin_No                                   3.88e-06 ***
## metformin_Steady                               0.000200 ***
## glimepiride_No                                 0.002188 ** 
## glimepiride_Steady                             3.39e-05 ***
## glipizide_Down                                 0.000487 ***
## glyburide_Down                                 0.036786 *  
## rosiglitazone_Down                             0.024001 *  
## insulin_Down                                   0.000240 ***
## insulin_Steady                                 4.62e-05 ***
## change                                         2.31e-05 ***
## diabetesMed                                     < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 159187  on 114828  degrees of freedom
## Residual deviance: 151593  on 114789  degrees of freedom
## AIC: 151673
## 
## Number of Fisher Scoring iterations: 4
lr3prob <- predict(lr3, db.lr3[db.lr3[,41] == 1, 1:39])
lr3predict <- ifelse(lr3prob>=0.5, 1, 0)
confusionMatrix(as.factor(lr3predict), db.lr3[db.lr3[,41] == 1, 40], mode="prec_recall", positive = "1") # Accuracy is 0.5771, which is very slightly lower than the last model. Recall is 0.2761, which is slightly lower as well, but not by much
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5601 4627
##          1  770 1765
##                                           
##                Accuracy : 0.5771          
##                  95% CI : (0.5685, 0.5857)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1551          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6963          
##                  Recall : 0.2761          
##                      F1 : 0.3954          
##              Prevalence : 0.5008          
##          Detection Rate : 0.1383          
##    Detection Prevalence : 0.1986          
##       Balanced Accuracy : 0.5776          
##                                           
##        'Positive' Class : 1               
## 
lr3roc <- roc(as.numeric(db.lr3[db.lr3[,41] == 1, 40]), as.numeric(lr3predict))
auc(lr3roc) # AUC = 0.5776, which is also very slightly lower than the last model
## Area under the curve: 0.5776

Since Test Model 2 and 3 are so similar, and it’s computationally feasible to run both for 10-fold cross validation, let’s run both and see which one performs better overall.

Logistic Regression Decide on Final Model

# Create two empty dataframes to house the evaluation measures for each model
accuracy <- rep(0, times = 10)
precision <- rep(0, times = 10)
recall <- rep(0, times = 10)
f1 <- rep(0, times = 10)
auc <- rep(0, times = 10)
eval.lr2 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))
eval.lr3 <- as.data.frame(cbind(accuracy, precision, recall, f1, auc))

# Run algorithm for each group of folds for Model 2; store evaluation metrics for each fold in a dataframe (eval.lr2)
for (f in 1:10){
  lr2class <- glm(formula = readmitted ~ ., data = db.lr2[db.lr2[,68] != f, 1:67], family = 'binomial')
  lr2prob <- predict(lr2class, db.lr2[db.lr2[,68] == f, 1:66])
  lr2predict <- ifelse(lr2prob>=0.5, 1, 0)
  lr2result <- confusionMatrix(as.factor(lr2predict), db.lr2[db.lr2[,68] == f, 67], mode="prec_recall", positive = "1")
  eval.lr2$accuracy[f] <- lr2result$overall['Accuracy']
  eval.lr2$precision[f] <- lr2result$byClass['Precision']
  eval.lr2$recall[f] <- lr2result$byClass['Recall']
  eval.lr2$f1[f] <- lr2result$byClass['F1']
  lr2roc <- roc(as.numeric(db.lr2[db.lr2[,68] == f, 67]), as.numeric(lr2predict))
  eval.lr2$auc[f] <- auc(lr2roc)
} 

# Evaluate Model 2
eval.lr2
##     accuracy precision    recall        f1       auc
## 1  0.5778422 0.6984190 0.2764393 0.3960995 0.5783389
## 2  0.5688996 0.6855006 0.2569693 0.3738106 0.5693625
## 3  0.5797373 0.7008181 0.2807866 0.4009360 0.5802524
## 4  0.5673287 0.6837895 0.2535123 0.3698895 0.5679188
## 5  0.5690275 0.6865609 0.2567504 0.3737362 0.5695655
## 6  0.5510797 0.6444834 0.2308295 0.3399146 0.5515582
## 7  0.5737615 0.6921831 0.2679523 0.3863457 0.5742183
## 8  0.5720682 0.7017849 0.2528627 0.3717712 0.5725453
## 9  0.5546709 0.6528384 0.2354702 0.3461049 0.5549983
## 10 0.5745332 0.6989648 0.2644110 0.3836800 0.5750694
summary(eval.lr2) # Mean Accuracy = 0.5689, mean Precision = 0.6845, mean Recall = 0.2576, mean f1 = 0.3742, mean AUC = 0.5694
##     accuracy        precision          recall             f1        
##  Min.   :0.5511   Min.   :0.6445   Min.   :0.2308   Min.   :0.3399  
##  1st Qu.:0.5677   1st Qu.:0.6842   1st Qu.:0.2530   1st Qu.:0.3704  
##  Median :0.5705   Median :0.6894   Median :0.2569   Median :0.3738  
##  Mean   :0.5689   Mean   :0.6845   Mean   :0.2576   Mean   :0.3742  
##  3rd Qu.:0.5743   3rd Qu.:0.6988   3rd Qu.:0.2671   3rd Qu.:0.3857  
##  Max.   :0.5797   Max.   :0.7018   Max.   :0.2808   Max.   :0.4009  
##       auc        
##  Min.   :0.5516  
##  1st Qu.:0.5683  
##  Median :0.5711  
##  Mean   :0.5694  
##  3rd Qu.:0.5749  
##  Max.   :0.5803
# Run algorithm for each group of folds for Model 3; store evaluation metrics for each fold in a dataframe (eval.lr3)
for (f in 1:10){
  lr3class <- glm(formula = readmitted ~ ., data = db.lr3[db.lr3[,41] != f, 1:40], family = 'binomial')
  lr3prob <- predict(lr3class, db.lr3[db.lr3[,41] == f, 1:39])
  lr3predict <- ifelse(lr3prob>=0.5, 1, 0)
  lr3result <- confusionMatrix(as.factor(lr3predict), db.lr3[db.lr3[,41] == f, 40], mode="prec_recall", positive = "1")
  eval.lr3$accuracy[f] <- lr3result$overall['Accuracy']
  eval.lr3$precision[f] <- lr3result$byClass['Precision']
  eval.lr3$recall[f] <- lr3result$byClass['Recall']
  eval.lr3$f1[f] <- lr3result$byClass['F1']
  lr3roc <- roc(as.numeric(db.lr3[db.lr3[,41] == f, 40]), as.numeric(lr3predict))
  eval.lr3$auc[f] <- auc(lr3roc)
} 

# Evaluate Model 3
eval.lr3
##     accuracy precision    recall        f1       auc
## 1  0.5771370 0.6962525 0.2761264 0.3954296 0.5776331
## 2  0.5664821 0.6841880 0.2493381 0.3654834 0.5669527
## 3  0.5767667 0.6960126 0.2751678 0.3944072 0.5772863
## 4  0.5713169 0.6925000 0.2594443 0.3774699 0.5719033
## 5  0.5668386 0.6817800 0.2534728 0.3695529 0.5673785
## 6  0.5564193 0.6541067 0.2422769 0.3535874 0.5568887
## 7  0.5742326 0.6924627 0.2693634 0.3878542 0.5746880
## 8  0.5712041 0.6989574 0.2523922 0.3708655 0.5716807
## 9  0.5530154 0.6474164 0.2348401 0.3446602 0.5533418
## 10 0.5723364 0.6980892 0.2575188 0.3762444 0.5728807
summary(eval.lr3) # Mean Accuracy = 0.5686, mean Precision = 0.6842, mean Recall = 0.2570, mean f1 = 0.3736, mean AUC = 0.5691; values are all slightly lower here than for Model 2
##     accuracy        precision          recall             f1        
##  Min.   :0.5530   Min.   :0.6474   Min.   :0.2348   Min.   :0.3447  
##  1st Qu.:0.5666   1st Qu.:0.6824   1st Qu.:0.2501   1st Qu.:0.3665  
##  Median :0.5713   Median :0.6925   Median :0.2555   Median :0.3736  
##  Mean   :0.5686   Mean   :0.6842   Mean   :0.2570   Mean   :0.3736  
##  3rd Qu.:0.5738   3rd Qu.:0.6962   3rd Qu.:0.2669   3rd Qu.:0.3853  
##  Max.   :0.5771   Max.   :0.6990   Max.   :0.2761   Max.   :0.3954  
##       auc        
##  Min.   :0.5533  
##  1st Qu.:0.5671  
##  Median :0.5718  
##  Mean   :0.5691  
##  3rd Qu.:0.5742  
##  Max.   :0.5776

The evaluation metric values are all slightly higher for Model 2 than Model 3 (though not by much), so we will consider Model 2 the final model for comparison.

Evaluation between Models

Before running any statistical tests for comparison between models, we need to determine whether each distribution of evaluation metrics over the 10 folds is normal, as the sample size of paired observations for each model is only 10 (because there are 10 folds). We will therefore use the Shapiro Wilk test of normality on each distribution. Note that with the Shapiro Wilk test a p-value above alpha (0.05) indicates we cannot reject the null hypothesis that the distribution is normal. A p-value less than alpha indicates that the distribution is likely not normal.

sapply(eval38[,1:5], shapiro.test) # p-values all greater than alpha - normally distributed
##           accuracy                      precision                    
## statistic 0.9565324                     0.9462429                    
## p.value   0.7457268                     0.6243126                    
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]"                      "X[[i]]"                     
##           recall                        f1                           
## statistic 0.9570563                     0.9435931                    
## p.value   0.7518574                     0.5936412                    
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]"                      "X[[i]]"                     
##           auc                          
## statistic 0.95582                      
## p.value   0.7373636                    
## method    "Shapiro-Wilk normality test"
## data.name "X[[i]]"
sapply(eval.knn[,1:5], shapiro.test) # p-values all greater than alpha - normally distributed
##           accuracy                      precision                    
## statistic 0.9373673                     0.9503424                    
## p.value   0.5241075                     0.6725528                    
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]"                      "X[[i]]"                     
##           recall                        f1                           
## statistic 0.906371                      0.9335421                    
## p.value   0.2569821                     0.4836223                    
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]"                      "X[[i]]"                     
##           auc                          
## statistic 0.9375499                    
## p.value   0.5260868                    
## method    "Shapiro-Wilk normality test"
## data.name "X[[i]]"
sapply(eval.rf[,1:5], shapiro.test) # p-values all greater than alpha - normally distributed
##           accuracy                      precision                    
## statistic 0.9330593                     0.9642497                    
## p.value   0.4786489                     0.8330225                    
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]"                      "X[[i]]"                     
##           recall                        f1                           
## statistic 0.8808684                     0.8726872                    
## p.value   0.1335223                     0.1074285                    
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]"                      "X[[i]]"                     
##           auc                          
## statistic 0.93312                      
## p.value   0.4792721                    
## method    "Shapiro-Wilk normality test"
## data.name "X[[i]]"
sapply(eval.lr2[,1:5], shapiro.test) # One of the p-values is less than alpha, for precision (p=0.01213728), indicating that we can reject the null hypothesis that the distribution is normal (i.e., we should treat it as not normal)
##           accuracy                      precision                    
## statistic 0.8825805                     0.7936504                    
## p.value   0.1396872                     0.01213728                   
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]"                      "X[[i]]"                     
##           recall                        f1                           
## statistic 0.9520318                     0.9313017                    
## p.value   0.6925719                     0.460817                     
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]"                      "X[[i]]"                     
##           auc                          
## statistic 0.8805073                    
## p.value   0.1322551                    
## method    "Shapiro-Wilk normality test"
## data.name "X[[i]]"

Given that all distributions are likely normal, except for the precision values in the logistic regression model metrics, and the sample size is small for each sample (n=10), we will use parametric statistical tests for accuracy, recall, f1, and AUC, and use non-parametric statistical tests for precision.

We will run the block design two-way ANOVA (parametric) or Friedman (non-parametric) tests for each type of evaluation metric (accuracy, precision, recall, f1, AUC). If we find that any of the means (parametric) or medians (non-parametric) are different between distributions, we will run a pairwise test to identify which differ (pairwise paired t-test for parametric, pairwise Wilcoxon signed rank test for non-parametric).

Before running any of these tests, we need to format the data accordingly.

# Put all of each type of metric together in one vector
acc <- c(eval38$accuracy, eval.knn$accuracy, eval.rf$accuracy, eval.lr2$accuracy) 
prec <- c(eval38$precision, eval.knn$precision, eval.rf$precision, eval.lr2$precision)
rec <- c(eval38$recall, eval.knn$recall, eval.rf$recall, eval.lr2$recall)
f1 <- c(eval38$f1, eval.knn$f1, eval.rf$f1, eval.lr2$f1)
auc <- c(eval38$auc, eval.knn$auc, eval.rf$auc, eval.lr2$auc)

# Assign each group (model) to the applicable values as a vector
model <- c(rep("nn", 10), rep("knn", 10), rep("rf", 10), rep("lr", 10)) # nn = neural network, knn = k nearest neighbours, rf = random forest, lr = logistic regression

# Assign the folds as a variable, as we need to keep track of these since they make the data paired
folds <- rep(1:10, times = 4) 

# Combine the evaluation metric, model, and folds into each data frame for analysis
eval.acc <- data.frame(acc, model, folds) 
eval.prec <- data.frame(prec, model, folds)
eval.rec <- data.frame(rec, model, folds)
eval.f1 <- data.frame(f1, model, folds)
eval.auc <- data.frame(auc, model, folds)

Let’s visualize the boxplots for each evaluation metric by model

plot(acc ~ model, data=eval.acc, main = "Accuracy by Model", ylab = "Accuracy", xlab = "Model") # Distribution appears lowest for knn, highest for random forest. Neural network is close to random forest, though with a slightly lower mean and lower error

plot(prec ~ model, data=eval.prec, main = "Precision by Model", ylab = "Precision", xlab = "Model") # Precision appears much higher for logistic regression, with knn as the lowest. Random forest and neural network are similar.

plot(rec ~ model, data=eval.rec, main = "Recall by Model", ylab = "Recall", xlab = "Model") # Highest for neural network, though hard to tell by how much with the scale here. Recall for logistic regression is very low.

plot(f1 ~ model, data=eval.f1, main = "F1 Score by Model", ylab = 'F1 Score', xlab = 'Model') # Highest for neural network here, with logistic regression f1 being very low (due to very low recall)

plot(auc ~ model, data=eval.auc, main = "Area under the ROC Curve (AUC) by Model", ylab = "AUC", xlab = 'Model') # Highest for random forest with neural network following closely behind. Knn is the lowest

Now we may run the initial statistical tests for each evaluation metric (two-way ANOVA or Friedman).

# ANOVA for Accuracy
acc.model <- lm(acc ~ model + folds, data = eval.acc)
anova(acc.model) # Findings are highly significant that at least one mean is different from the others between models (p=2.424e-14)
## Analysis of Variance Table
## 
## Response: acc
##           Df    Sum Sq   Mean Sq F value    Pr(>F)    
## model      3 0.0229342 0.0076447 64.5525 2.424e-14 ***
## folds      1 0.0002104 0.0002104  1.7768    0.1912    
## Residuals 35 0.0041449 0.0001184                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Friedman for Precision
friedman.test(prec ~ model | folds, data=eval.prec) # Findings are highly significant that at least one median is different from the others between models (p=3.494e-06)
## 
##  Friedman rank sum test
## 
## data:  prec and model and folds
## Friedman chi-squared = 28.08, df = 3, p-value = 3.494e-06
# ANOVA for Recall
rec.model <- lm(rec ~ model+folds, data=eval.rec)
anova(rec.model) # Findings are highly significant that at least one mean is different from the others between models (p<2e-16)
## Analysis of Variance Table
## 
## Response: rec
##           Df  Sum Sq  Mean Sq  F value Pr(>F)    
## model      3 0.67231 0.224105 509.6068 <2e-16 ***
## folds      1 0.00107 0.001072   2.4375 0.1275    
## Residuals 35 0.01539 0.000440                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ANOVA for f1
f1.model <- lm(f1 ~ model+folds, data=eval.f1)
anova(f1.model) # Findings are highly significant that at least one mean is different from the others between models (p<2e-16)
## Analysis of Variance Table
## 
## Response: f1
##           Df   Sum Sq  Mean Sq  F value  Pr(>F)    
## model      3 0.309348 0.103116 407.7888 < 2e-16 ***
## folds      1 0.000777 0.000777   3.0733 0.08834 .  
## Residuals 35 0.008850 0.000253                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ANOVA for AUC
auc.model <- lm(auc ~ model+folds, data=eval.auc)
anova(auc.model) # Findings are highly significant that at least one mean is different from the others between models (p=2.587e-14)
## Analysis of Variance Table
## 
## Response: auc
##           Df    Sum Sq   Mean Sq F value    Pr(>F)    
## model      3 0.0228624 0.0076208 64.2680 2.587e-14 ***
## folds      1 0.0002113 0.0002113  1.7816    0.1906    
## Residuals 35 0.0041502 0.0001186                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since there was a statistically significant difference in at least one of the means or medians found between models in each of the 5 evaluation metrics, we will perform pairwise two-sided tests for all of them to identify where the differences lie.

# Pairwise Paired Two-Sided T-Test for Accuracy
pairwise.t.test(eval.acc$acc, eval.acc$model, p.adjust.method="bonferroni", paired = TRUE) # Accuracy for neural network and random forest are not significantly different from one another (p=1.0). Recall from the boxplots that these had the highest accuracy. The accuracy for all other pairings are significantly different from one another.
## 
##  Pairwise comparisons using paired t tests 
## 
## data:  eval.acc$acc and eval.acc$model 
## 
##    knn     lr      nn    
## lr 0.0018  -       -     
## nn 2.0e-06 7.2e-07 -     
## rf 7.3e-08 7.4e-08 1.0000
## 
## P value adjustment method: bonferroni
# Pairwise Two-Sided Wilcoxon Signed Rank Test for Precision
pairwise.wilcox.test(eval.prec$prec, eval.prec$model, p.adjust.method = "bonferroni", paired = TRUE) # Precision for neural network and random forest are not significantly different from one another (p=0.223). All other pairing are significantly different. Logistic regression had the highest precision (likely at the expense of recall).
## 
##  Pairwise comparisons using Wilcoxon signed rank test 
## 
## data:  eval.prec$prec and eval.prec$model 
## 
##    knn   lr    nn   
## lr 0.012 -     -    
## nn 0.012 0.012 -    
## rf 0.012 0.012 0.223
## 
## P value adjustment method: bonferroni
# Pairwise Paired Two-Sided T-Test for Recall
pairwise.t.test(eval.rec$rec, eval.rec$model, p.adjust.method="bonferroni", paired = TRUE) # Neural network and random forest (the two highest on the boxplot) are not significantly different from one another (p=0.432). Interestingly, random forest and knn (3rd highest) are not significantly different from one another (p=0.090), while the neural network and knn are significantly different from one another at an alpha = 0.05 level (p=0.044). All other pairings are significantly different.
## 
##  Pairwise comparisons using paired t tests 
## 
## data:  eval.rec$rec and eval.rec$model 
## 
##    knn     lr      nn   
## lr 1.8e-10 -       -    
## nn 0.044   1.7e-09 -    
## rf 0.090   2.8e-14 0.432
## 
## P value adjustment method: bonferroni
# Pairwise Paired Two-Sided T-Test for F1 Score
pairwise.t.test(eval.f1$f1, eval.f1$model, p.adjust.method="bonferroni", paired = TRUE) # The f1 score for neural network and random forest are not significantly different from one another (p=1.0). These two have the highest f1 scores, as is visible in the boxplot. The f1 score for all other pairings are significantly different from one another.
## 
##  Pairwise comparisons using paired t tests 
## 
## data:  eval.f1$f1 and eval.f1$model 
## 
##    knn     lr      nn     
## lr 3.5e-09 -       -      
## nn 0.00028 1.4e-09 -      
## rf 2.8e-05 8.0e-13 1.00000
## 
## P value adjustment method: bonferroni
# Pairwise Paired Two-Sided T-Test for AUC
pairwise.t.test(eval.auc$auc, eval.auc$model, p.adjust.method="bonferroni", paired = TRUE) # AUC for neural network and random forest are not significantly different from one another (p=1.0). These two had the highest AUC, as is visible in the boxplots. The AUC for all other pairings are significantly different from one another.
## 
##  Pairwise comparisons using paired t tests 
## 
## data:  eval.auc$auc and eval.auc$model 
## 
##    knn     lr      nn    
## lr 0.0015  -       -     
## nn 1.9e-06 8.3e-07 -     
## rf 7.2e-08 7.9e-08 1.0000
## 
## P value adjustment method: bonferroni

The neural network and random forest models had the highest values in every evaluation metric except precision, where logistic regression was the highest. However, logistic regression had very low recall overall (mean recall of 0.2576), removing it as a viable option for prediction. The neural network and random forest evaluation metrics were not signficantly different from one another for every one of the five evaluation metrics. One difference however, is that with recall, the random forest model was not significantly different from the model with the next lowest score (knn), while the neural network model was significantly different from knn. The neural network model did have a higher mean recall than the random forest.

We will be moving forward with both the optimized neural network model and random forest model to determine variable importance, since they performed similarly. This will allow us to compare and contrast the variables of importance used in each model

Variable Importance

Neural Nework Model

Now that we have evaluated the models, we will use the optimized neural network model to identify the variables with the highest importance in model outcomes. We will do this by systematically removing each variable and identifying the effect the removal had on the AUC metric. AUC was chosen, as this has performed very similar to accuracy, and was used in most research papers reviewed in the literature review. As such, it will be easier to compare.

imp.auc <- rep(0, 81) # Create empty vector to house AUC for each missing variable
allvariables <- names(db.over) # Create object with variable names to use for evaluation after
variables <- allvariables[1:81] # Exclude variable names for 'readmitted' and 'folds'

for (i in 1:81) {
  
  # Initiate
  model.imp <- keras_model_sequential()
  
  # Layers
  model.imp %>%
    layer_dropout(rate = 0.5, input_shape = c(80)) %>% # Removed 1 input neuron as there will be 1 less input
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 1, activation = 'sigmoid')

# Compile
  model.imp %>% compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = 'accuracy'
  )

  # Train
  model.imp %>% fit(
    db[db[,83] != 1,  -c(i, 82, 83)], 
    db[db[,83] != 1, 82], 
    epochs = 50, 
    batch_size = 128,
    verbose = 0
  )

  # Predict classes for test data
  mod.imp.class <- model.imp %>% predict_classes(db[db[,83] == 1, -c(i, 82, 83)], batch_size = 128)

  # Evaluation
  mod.imp.roc <- roc(as.numeric(db[db[,83] == 1, 82]), as.numeric(mod.imp.class))
  imp.auc[i] <- as.numeric(auc(mod.imp.roc))
}

var.imp <- data.frame(imp.auc, variables) # Create data frame with AUC values and corresponding variable that's missing in that iteration
var.imp$diff.auc <- 0.6063378 - var.imp$imp.auc # Create a derived variable of the AUC for the optimized neural network on the same fold minus the AUC obtained for each iteration (difference in AUC)
var.imp <- var.imp[order(-var.imp$diff.auc),] # Order the rows by diff.auc in ascending order
var.imp # Review findings 
##      imp.auc                                      variables      diff.auc
## 7  0.5940439                               number_inpatient  0.0122938599
## 24 0.5982857       discharge_disposition_id_DcOtherFacility  0.0080520759
## 15 0.5990588                                     race_Asian  0.0072789707
## 70 0.5990886                            pioglitazone_Steady  0.0072492212
## 26 0.5993550 discharge_disposition_id_DcHomeWithHomeService  0.0069828390
## 1  0.6006957                               time_in_hospital  0.0056420875
## 78 0.6012359                                 insulin_Steady  0.0051018826
## 47 0.6023248                                   diag_3_Other  0.0040129517
## 62 0.6027523                               glipizide_Steady  0.0035854829
## 65 0.6027748                                   glyburide_No  0.0035629528
## 3  0.6028479                                 num_procedures  0.0034898868
## 49 0.6028875                             diag_3_250.0.250.9  0.0034503395
## 74 0.6030014                           rosiglitazone_Steady  0.0033364371
## 52 0.6043274                                 metformin_Down  0.0020103578
## 25 0.6048650                 discharge_disposition_id_DcSNF  0.0014727557
## 16 0.6048992                                  race_Hispanic  0.0014385616
## 36 0.6050523                                 diag_1_520.579  0.0012854680
## 51 0.6051496                                 diag_3_580.629  0.0011881653
## 20 0.6051918                       admission_type_id_Urgent  0.0011460396
## 66 0.6064173                               glyburide_Steady -0.0000794833
## 4  0.6067629                                num_medications -0.0004251197
## 68 0.6068058                              pioglitazone_Down -0.0004680188
## 75 0.6069505                               rosiglitazone_Up -0.0006127021
## 43 0.6071214                             diag_2_250.0.250.9 -0.0007835865
## 29 0.6071491        admission_source_id_TransferExtFacility -0.0008113347
## 32 0.6075352                                 diag_1_390.459 -0.0011973901
## 34 0.6077870                             diag_1_250.0.250.9 -0.0014492352
## 27 0.6079642                 discharge_disposition_id_Other -0.0016264060
## 80 0.6080882                                         change -0.0017503641
## 2  0.6085198                             num_lab_procedures -0.0021819584
## 28 0.6086275                    admission_source_id_PhysRef -0.0022896728
## 23 0.6087274                discharge_disposition_id_DcHome -0.0023895538
## 38 0.6088513                                 diag_1_780.799 -0.0025135119
## 46 0.6093792                                 diag_3_390.459 -0.0030413776
## 72 0.6094014                             rosiglitazone_Down -0.0030635515
## 21 0.6094135                     admission_type_id_Elective -0.0030756699
## 19 0.6094274                    admission_type_id_Emergency -0.0030895931
## 73 0.6096316                               rosiglitazone_No -0.0032938366
## 67 0.6097001                                   glyburide_Up -0.0033622616
## 14 0.6097540                           race_AfricanAmerican -0.0034162477
## 71 0.6102025                                pioglitazone_Up -0.0038646996
## 5  0.6102052                              number_outpatient -0.0038674376
## 13 0.6102663                                 race_Caucasian -0.0039284835
## 76 0.6103295                                   insulin_Down -0.0039916536
## 45 0.6104761                                 diag_2_580.629 -0.0041383013
## 37 0.6105187                                 diag_1_710.739 -0.0041809426
## 6  0.6108432                               number_emergency -0.0045054364
## 44 0.6108446                                 diag_2_460.519 -0.0045068238
## 64 0.6111555                                 glyburide_Down -0.0048177136
## 30 0.6112433                         admission_source_id_ER -0.0049054764
## 60 0.6113846                                 glipizide_Down -0.0050468078
## 40 0.6118399                                 diag_2_390.459 -0.0055021230
## 42 0.6119143                         diag_2_240.279.not250. -0.0055764782
## 50 0.6121413                                 diag_3_460.519 -0.0058035097
## 41 0.6121447                                   diag_2_Other -0.0058068616
## 61 0.6122910                                   glipizide_No -0.0059531533
## 81 0.6126121                                    diabetesMed -0.0062742952
## 48 0.6126346                         diag_3_240.279.not250. -0.0062968252
## 79 0.6128290                                     insulin_Up -0.0064911728
## 33 0.6129020                                   diag_1_Other -0.0065642388
## 59 0.6131411                                 glimepiride_Up -0.0068032904
## 12 0.6135820                                 A1Cresult_Norm -0.0072441668
## 11 0.6136814                                 A1Cresult_None -0.0073436304
## 63 0.6136849                                   glipizide_Up -0.0073471419
## 53 0.6139939                                   metformin_No -0.0076561041
## 18 0.6140193                                         gender -0.0076815317
## 54 0.6140968                               metformin_Steady -0.0077589809
## 8  0.6144581                               number_diagnoses -0.0081202840
## 10 0.6150165                                   A1Cresult_.7 -0.0086786726
## 56 0.6151436                               glimepiride_Down -0.0088058230
## 35 0.6154375                                 diag_1_460.519 -0.0090996955
## 55 0.6154509                                   metformin_Up -0.0091131031
## 17 0.6157818                                     race_Other -0.0094440427
## 9  0.6158897                                            age -0.0095519168
## 69 0.6166366                                pioglitazone_No -0.0102988209
## 22 0.6167714                        admission_type_id_Other -0.0104336082
## 77 0.6171305                                     insulin_No -0.0107926521
## 58 0.6174351                             glimepiride_Steady -0.0110972924
## 31 0.6176560                      admission_source_id_Other -0.0113181972
## 39 0.6176716                                 diag_1_800.999 -0.0113337656
## 57 0.6186779                                 glimepiride_No -0.0123400904

Random Forest Model

Let’s compare this to the variable importance noted by the final random forest model as well, since the two models performed similarly. This may be done easily while re-running the model by changing the ‘importance’ argument to true (‘T’).

rf.var.imp <- randomForest(readmitted ~ ., data=db.rf[db.rf[,83] != 1, 1:82], ntree=100, mtry=6, nodesize = 500, maxnodes = 17000, importance=T) # Re-run the model on the same folds as the neural network above
rf.imp <- importance(rf.var.imp) # Save the importance output to an object
rf.imp[order(-rf.imp[,3]),] # Sort by mean decrease in accuracy (descending order). Highest mean decrease in accuracy is number_inpatient (same as for the neural network). Other ones in the top 10 here that are also in the top 10 for the neural network are time_in_hospital and discharge_disposition_id_DcOtherFacility
##                                                         0           1
## number_inpatient                               24.1900849 23.84230428
## age                                            10.9869512 15.82734589
## num_medications                                 2.1930784 17.62458147
## num_lab_procedures                              7.7131634 19.83301285
## time_in_hospital                               10.3663535 13.78594115
## discharge_disposition_id_DcHome                14.5513560  9.45339930
## number_diagnoses                                5.5836536 16.06915224
## discharge_disposition_id_DcOtherFacility        7.7824707 15.51043757
## num_procedures                                  5.9925052 15.05838793
## number_emergency                                8.8039522 12.65196733
## race_Other                                      0.5002715 13.69468841
## diag_2_580.629                                  3.7079860 13.07617459
## race_Hispanic                                   0.7543741 12.11099998
## discharge_disposition_id_DcSNF                  8.6857934  7.94463198
## glimepiride_Steady                              0.2010008 10.62961100
## diag_3_250.0.250.9                              3.8537883 10.57514157
## number_outpatient                               7.9799062 11.03609641
## diag_3_580.629                                  2.7604690 10.73769356
## glimepiride_No                                  1.1822328 11.08603385
## diag_2_250.0.250.9                              0.9073905 10.79213258
## admission_type_id_Urgent                        0.9326043  9.51224145
## diag_2_460.519                                  4.5554826  9.65229697
## diag_1_250.0.250.9                              3.3086752  9.78208328
## diag_1_Other                                    3.1728715 10.57292323
## admission_source_id_TransferExtFacility         3.5172049  9.96513533
## A1Cresult_Norm                                  1.9424454 10.51332225
## diag_2_240.279.not250.                          3.1160765 10.62532706
## glipizide_Up                                    3.5945204 10.02361035
## rosiglitazone_No                                0.4713427  9.93579835
## glyburide_Up                                    0.9813711 10.09813597
## metformin_Up                                    3.4344588  9.31511671
## race_AfricanAmerican                           -2.6590198  9.42416972
## diag_1_710.739                                  3.8082891  8.82863050
## diag_3_390.459                                  4.3078870  9.60556662
## gender                                         -1.6405112 10.65479000
## discharge_disposition_id_DcHomeWithHomeService  9.3703155  0.07048686
## pioglitazone_No                                 3.5826837  9.49094191
## glyburide_Down                                  3.8722237 10.06373963
## diag_1_780.799                                  6.6999427  8.20058470
## diag_1_390.459                                  3.9865531 10.45544730
## diabetesMed                                     5.3759861  6.17055013
## diag_1_460.519                                  6.2207083  8.16626329
## pioglitazone_Steady                             2.3744028 10.45148702
## glipizide_Down                                  1.7192820 10.36994975
## race_Asian                                     -2.3891635 11.58928858
## diag_3_460.519                                  6.2394695 10.18699950
## insulin_Steady                                  4.0652623  8.16256922
## metformin_No                                    2.3301048  7.50052417
## metformin_Down                                 -2.1648139  9.51139297
## glyburide_Steady                               -2.9126503  8.89659338
## A1Cresult_.7                                    2.1875691  8.95404729
## admission_type_id_Emergency                     2.4751330  7.94639358
## diag_3_240.279.not250.                          2.1663534  9.26874224
## admission_type_id_Elective                      1.5634811  7.39258448
## diag_1_520.579                                 -0.7315720  8.61640853
## admission_source_id_ER                          1.0403159  6.89356636
## glipizide_Steady                                2.7260345  6.93221814
## diag_2_Other                                   -1.1201704  8.95944544
## rosiglitazone_Steady                            0.4822536  8.36262094
## glimepiride_Up                                 -1.4319494  9.61702877
## diag_1_800.999                                  2.9348281  7.77344775
## glyburide_No                                   -1.5763772  7.08590007
## A1Cresult_None                                  2.2313368  7.51412348
## glimepiride_Down                                2.0605294  9.02869980
## race_Caucasian                                 -1.3295215  8.83157824
## pioglitazone_Up                                 3.7708047  7.14507857
## diag_3_Other                                    2.5013194  7.50148173
## insulin_Down                                    2.2209495  7.96595427
## insulin_No                                      2.8134527  7.50820384
## discharge_disposition_id_Other                  3.9151434  3.36050938
## diag_2_390.459                                  3.2708757  6.96198870
## admission_source_id_PhysRef                     3.0586153  5.64758056
## metformin_Steady                                1.5986706  5.82515618
## glipizide_No                                    2.8875435  8.54984364
## insulin_Up                                      2.0570592  6.57913459
## change                                          3.6782085  6.16403807
## rosiglitazone_Up                               -0.3480543  6.38180999
## pioglitazone_Down                               0.6420784  4.90001087
## rosiglitazone_Down                             -1.8529818  4.53099557
## admission_type_id_Other                         1.2893268  3.24593437
## admission_source_id_Other                      -1.0050378  1.35466574
##                                                MeanDecreaseAccuracy
## number_inpatient                                         26.6896279
## age                                                      19.5011544
## num_medications                                          19.4635692
## num_lab_procedures                                       19.2680777
## time_in_hospital                                         17.6750999
## discharge_disposition_id_DcHome                          17.0970735
## number_diagnoses                                         16.5172479
## discharge_disposition_id_DcOtherFacility                 16.4977084
## num_procedures                                           15.3313954
## number_emergency                                         13.3902766
## race_Other                                               13.1939725
## diag_2_580.629                                           13.1316252
## race_Hispanic                                            12.9251025
## discharge_disposition_id_DcSNF                           12.8115110
## glimepiride_Steady                                       12.6715893
## diag_3_250.0.250.9                                       12.5406373
## number_outpatient                                        12.4159590
## diag_3_580.629                                           11.9447574
## glimepiride_No                                           11.6945507
## diag_2_250.0.250.9                                       11.6872575
## admission_type_id_Urgent                                 11.6712347
## diag_2_460.519                                           11.5126188
## diag_1_250.0.250.9                                       11.3083006
## diag_1_Other                                             11.2072570
## admission_source_id_TransferExtFacility                  11.0800950
## A1Cresult_Norm                                           11.0033683
## diag_2_240.279.not250.                                   10.9900738
## glipizide_Up                                             10.7804539
## rosiglitazone_No                                         10.7775083
## glyburide_Up                                             10.6538180
## metformin_Up                                             10.6435103
## race_AfricanAmerican                                     10.6099292
## diag_1_710.739                                           10.6050228
## diag_3_390.459                                           10.5919700
## gender                                                   10.5794799
## discharge_disposition_id_DcHomeWithHomeService           10.5441327
## pioglitazone_No                                          10.4982311
## glyburide_Down                                           10.4265774
## diag_1_780.799                                           10.3330510
## diag_1_390.459                                           10.2584996
## diabetesMed                                              10.1866187
## diag_1_460.519                                           10.1510994
## pioglitazone_Steady                                      10.1125507
## glipizide_Down                                            9.8066336
## race_Asian                                                9.8034308
## diag_3_460.519                                            9.7877419
## insulin_Steady                                            9.7549940
## metformin_No                                              9.6198653
## metformin_Down                                            9.5273982
## glyburide_Steady                                          9.3583392
## A1Cresult_.7                                              9.3373276
## admission_type_id_Emergency                               9.1628492
## diag_3_240.279.not250.                                    9.1161665
## admission_type_id_Elective                                9.0701260
## diag_1_520.579                                            8.6497403
## admission_source_id_ER                                    8.6262322
## glipizide_Steady                                          8.6232793
## diag_2_Other                                              8.6046700
## rosiglitazone_Steady                                      8.5779074
## glimepiride_Up                                            8.5131870
## diag_1_800.999                                            8.4326263
## glyburide_No                                              8.3883311
## A1Cresult_None                                            8.3510701
## glimepiride_Down                                          8.3417774
## race_Caucasian                                            8.2958087
## pioglitazone_Up                                           8.2031469
## diag_3_Other                                              8.1057916
## insulin_Down                                              7.9739692
## insulin_No                                                7.9692008
## discharge_disposition_id_Other                            7.8760492
## diag_2_390.459                                            7.7775343
## admission_source_id_PhysRef                               7.1320642
## metformin_Steady                                          7.1026725
## glipizide_No                                              6.8949536
## insulin_Up                                                6.6651456
## change                                                    6.6344365
## rosiglitazone_Up                                          5.7213388
## pioglitazone_Down                                         4.4501799
## rosiglitazone_Down                                        3.4429731
## admission_type_id_Other                                   3.1454255
## admission_source_id_Other                                 0.8182474
##                                                MeanDecreaseGini
## number_inpatient                                     621.512426
## age                                                  327.062205
## num_medications                                      384.631808
## num_lab_procedures                                   400.145651
## time_in_hospital                                     350.991229
## discharge_disposition_id_DcHome                      662.582270
## number_diagnoses                                     244.657062
## discharge_disposition_id_DcOtherFacility             444.016634
## num_procedures                                       148.578506
## number_emergency                                     125.905832
## race_Other                                            31.910177
## diag_2_580.629                                        38.439982
## race_Hispanic                                         30.895365
## discharge_disposition_id_DcSNF                       201.916319
## glimepiride_Steady                                    29.959662
## diag_3_250.0.250.9                                    44.329230
## number_outpatient                                    115.847108
## diag_3_580.629                                        37.358298
## glimepiride_No                                        31.173782
## diag_2_250.0.250.9                                    50.822775
## admission_type_id_Urgent                              38.295915
## diag_2_460.519                                        43.603508
## diag_1_250.0.250.9                                    47.983343
## diag_1_Other                                          47.284156
## admission_source_id_TransferExtFacility               37.866302
## A1Cresult_Norm                                        33.304157
## diag_2_240.279.not250.                                41.878723
## glipizide_Up                                          16.078325
## rosiglitazone_No                                      31.330688
## glyburide_Up                                          20.139535
## metformin_Up                                          23.259486
## race_AfricanAmerican                                  36.254638
## diag_1_710.739                                        71.296273
## diag_3_390.459                                        39.189726
## gender                                                43.346859
## discharge_disposition_id_DcHomeWithHomeService        85.428634
## pioglitazone_No                                       37.530527
## glyburide_Down                                        18.309285
## diag_1_780.799                                        66.688685
## diag_1_390.459                                        77.100161
## diabetesMed                                           65.227630
## diag_1_460.519                                        47.316170
## pioglitazone_Steady                                   32.844591
## glipizide_Down                                        19.333497
## race_Asian                                            19.380843
## diag_3_460.519                                        44.075098
## insulin_Steady                                        37.448930
## metformin_No                                          43.900171
## metformin_Down                                        19.201035
## glyburide_Steady                                      33.912885
## A1Cresult_.7                                          46.992268
## admission_type_id_Emergency                           43.488502
## diag_3_240.279.not250.                                45.583893
## admission_type_id_Elective                            48.545258
## diag_1_520.579                                        39.584783
## admission_source_id_ER                                38.597732
## glipizide_Steady                                      39.365237
## diag_2_Other                                          32.853050
## rosiglitazone_Steady                                  33.943044
## glimepiride_Up                                        13.040420
## diag_1_800.999                                        41.256242
## glyburide_No                                          33.138577
## A1Cresult_None                                        46.374683
## glimepiride_Down                                       9.788466
## race_Caucasian                                        37.162406
## pioglitazone_Up                                        9.391275
## diag_3_Other                                          39.551793
## insulin_Down                                          40.027178
## insulin_No                                            45.310808
## discharge_disposition_id_Other                        20.543998
## diag_2_390.459                                        40.726136
## admission_source_id_PhysRef                           34.548778
## metformin_Steady                                      35.259186
## glipizide_No                                          34.092538
## insulin_Up                                            28.267457
## change                                                35.181589
## rosiglitazone_Up                                       5.762974
## pioglitazone_Down                                      4.478572
## rosiglitazone_Down                                     2.845325
## admission_type_id_Other                                1.340301
## admission_source_id_Other                              0.454570
rf.imp[order(-rf.imp[,4]),] # Sort by mean decrease in Gini (descending order). In the top 10 here as well as in nn are number_inpatient, discharge_disposition_id_DcOtherFacility and time_in_hospital.
##                                                         0           1
## discharge_disposition_id_DcHome                14.5513560  9.45339930
## number_inpatient                               24.1900849 23.84230428
## discharge_disposition_id_DcOtherFacility        7.7824707 15.51043757
## num_lab_procedures                              7.7131634 19.83301285
## num_medications                                 2.1930784 17.62458147
## time_in_hospital                               10.3663535 13.78594115
## age                                            10.9869512 15.82734589
## number_diagnoses                                5.5836536 16.06915224
## discharge_disposition_id_DcSNF                  8.6857934  7.94463198
## num_procedures                                  5.9925052 15.05838793
## number_emergency                                8.8039522 12.65196733
## number_outpatient                               7.9799062 11.03609641
## discharge_disposition_id_DcHomeWithHomeService  9.3703155  0.07048686
## diag_1_390.459                                  3.9865531 10.45544730
## diag_1_710.739                                  3.8082891  8.82863050
## diag_1_780.799                                  6.6999427  8.20058470
## diabetesMed                                     5.3759861  6.17055013
## diag_2_250.0.250.9                              0.9073905 10.79213258
## admission_type_id_Elective                      1.5634811  7.39258448
## diag_1_250.0.250.9                              3.3086752  9.78208328
## diag_1_460.519                                  6.2207083  8.16626329
## diag_1_Other                                    3.1728715 10.57292323
## A1Cresult_.7                                    2.1875691  8.95404729
## A1Cresult_None                                  2.2313368  7.51412348
## diag_3_240.279.not250.                          2.1663534  9.26874224
## insulin_No                                      2.8134527  7.50820384
## diag_3_250.0.250.9                              3.8537883 10.57514157
## diag_3_460.519                                  6.2394695 10.18699950
## metformin_No                                    2.3301048  7.50052417
## diag_2_460.519                                  4.5554826  9.65229697
## admission_type_id_Emergency                     2.4751330  7.94639358
## gender                                         -1.6405112 10.65479000
## diag_2_240.279.not250.                          3.1160765 10.62532706
## diag_1_800.999                                  2.9348281  7.77344775
## diag_2_390.459                                  3.2708757  6.96198870
## insulin_Down                                    2.2209495  7.96595427
## diag_1_520.579                                 -0.7315720  8.61640853
## diag_3_Other                                    2.5013194  7.50148173
## glipizide_Steady                                2.7260345  6.93221814
## diag_3_390.459                                  4.3078870  9.60556662
## admission_source_id_ER                          1.0403159  6.89356636
## diag_2_580.629                                  3.7079860 13.07617459
## admission_type_id_Urgent                        0.9326043  9.51224145
## admission_source_id_TransferExtFacility         3.5172049  9.96513533
## pioglitazone_No                                 3.5826837  9.49094191
## insulin_Steady                                  4.0652623  8.16256922
## diag_3_580.629                                  2.7604690 10.73769356
## race_Caucasian                                 -1.3295215  8.83157824
## race_AfricanAmerican                           -2.6590198  9.42416972
## metformin_Steady                                1.5986706  5.82515618
## change                                          3.6782085  6.16403807
## admission_source_id_PhysRef                     3.0586153  5.64758056
## glipizide_No                                    2.8875435  8.54984364
## rosiglitazone_Steady                            0.4822536  8.36262094
## glyburide_Steady                               -2.9126503  8.89659338
## A1Cresult_Norm                                  1.9424454 10.51332225
## glyburide_No                                   -1.5763772  7.08590007
## diag_2_Other                                   -1.1201704  8.95944544
## pioglitazone_Steady                             2.3744028 10.45148702
## race_Other                                      0.5002715 13.69468841
## rosiglitazone_No                                0.4713427  9.93579835
## glimepiride_No                                  1.1822328 11.08603385
## race_Hispanic                                   0.7543741 12.11099998
## glimepiride_Steady                              0.2010008 10.62961100
## insulin_Up                                      2.0570592  6.57913459
## metformin_Up                                    3.4344588  9.31511671
## discharge_disposition_id_Other                  3.9151434  3.36050938
## glyburide_Up                                    0.9813711 10.09813597
## race_Asian                                     -2.3891635 11.58928858
## glipizide_Down                                  1.7192820 10.36994975
## metformin_Down                                 -2.1648139  9.51139297
## glyburide_Down                                  3.8722237 10.06373963
## glipizide_Up                                    3.5945204 10.02361035
## glimepiride_Up                                 -1.4319494  9.61702877
## glimepiride_Down                                2.0605294  9.02869980
## pioglitazone_Up                                 3.7708047  7.14507857
## rosiglitazone_Up                               -0.3480543  6.38180999
## pioglitazone_Down                               0.6420784  4.90001087
## rosiglitazone_Down                             -1.8529818  4.53099557
## admission_type_id_Other                         1.2893268  3.24593437
## admission_source_id_Other                      -1.0050378  1.35466574
##                                                MeanDecreaseAccuracy
## discharge_disposition_id_DcHome                          17.0970735
## number_inpatient                                         26.6896279
## discharge_disposition_id_DcOtherFacility                 16.4977084
## num_lab_procedures                                       19.2680777
## num_medications                                          19.4635692
## time_in_hospital                                         17.6750999
## age                                                      19.5011544
## number_diagnoses                                         16.5172479
## discharge_disposition_id_DcSNF                           12.8115110
## num_procedures                                           15.3313954
## number_emergency                                         13.3902766
## number_outpatient                                        12.4159590
## discharge_disposition_id_DcHomeWithHomeService           10.5441327
## diag_1_390.459                                           10.2584996
## diag_1_710.739                                           10.6050228
## diag_1_780.799                                           10.3330510
## diabetesMed                                              10.1866187
## diag_2_250.0.250.9                                       11.6872575
## admission_type_id_Elective                                9.0701260
## diag_1_250.0.250.9                                       11.3083006
## diag_1_460.519                                           10.1510994
## diag_1_Other                                             11.2072570
## A1Cresult_.7                                              9.3373276
## A1Cresult_None                                            8.3510701
## diag_3_240.279.not250.                                    9.1161665
## insulin_No                                                7.9692008
## diag_3_250.0.250.9                                       12.5406373
## diag_3_460.519                                            9.7877419
## metformin_No                                              9.6198653
## diag_2_460.519                                           11.5126188
## admission_type_id_Emergency                               9.1628492
## gender                                                   10.5794799
## diag_2_240.279.not250.                                   10.9900738
## diag_1_800.999                                            8.4326263
## diag_2_390.459                                            7.7775343
## insulin_Down                                              7.9739692
## diag_1_520.579                                            8.6497403
## diag_3_Other                                              8.1057916
## glipizide_Steady                                          8.6232793
## diag_3_390.459                                           10.5919700
## admission_source_id_ER                                    8.6262322
## diag_2_580.629                                           13.1316252
## admission_type_id_Urgent                                 11.6712347
## admission_source_id_TransferExtFacility                  11.0800950
## pioglitazone_No                                          10.4982311
## insulin_Steady                                            9.7549940
## diag_3_580.629                                           11.9447574
## race_Caucasian                                            8.2958087
## race_AfricanAmerican                                     10.6099292
## metformin_Steady                                          7.1026725
## change                                                    6.6344365
## admission_source_id_PhysRef                               7.1320642
## glipizide_No                                              6.8949536
## rosiglitazone_Steady                                      8.5779074
## glyburide_Steady                                          9.3583392
## A1Cresult_Norm                                           11.0033683
## glyburide_No                                              8.3883311
## diag_2_Other                                              8.6046700
## pioglitazone_Steady                                      10.1125507
## race_Other                                               13.1939725
## rosiglitazone_No                                         10.7775083
## glimepiride_No                                           11.6945507
## race_Hispanic                                            12.9251025
## glimepiride_Steady                                       12.6715893
## insulin_Up                                                6.6651456
## metformin_Up                                             10.6435103
## discharge_disposition_id_Other                            7.8760492
## glyburide_Up                                             10.6538180
## race_Asian                                                9.8034308
## glipizide_Down                                            9.8066336
## metformin_Down                                            9.5273982
## glyburide_Down                                           10.4265774
## glipizide_Up                                             10.7804539
## glimepiride_Up                                            8.5131870
## glimepiride_Down                                          8.3417774
## pioglitazone_Up                                           8.2031469
## rosiglitazone_Up                                          5.7213388
## pioglitazone_Down                                         4.4501799
## rosiglitazone_Down                                        3.4429731
## admission_type_id_Other                                   3.1454255
## admission_source_id_Other                                 0.8182474
##                                                MeanDecreaseGini
## discharge_disposition_id_DcHome                      662.582270
## number_inpatient                                     621.512426
## discharge_disposition_id_DcOtherFacility             444.016634
## num_lab_procedures                                   400.145651
## num_medications                                      384.631808
## time_in_hospital                                     350.991229
## age                                                  327.062205
## number_diagnoses                                     244.657062
## discharge_disposition_id_DcSNF                       201.916319
## num_procedures                                       148.578506
## number_emergency                                     125.905832
## number_outpatient                                    115.847108
## discharge_disposition_id_DcHomeWithHomeService        85.428634
## diag_1_390.459                                        77.100161
## diag_1_710.739                                        71.296273
## diag_1_780.799                                        66.688685
## diabetesMed                                           65.227630
## diag_2_250.0.250.9                                    50.822775
## admission_type_id_Elective                            48.545258
## diag_1_250.0.250.9                                    47.983343
## diag_1_460.519                                        47.316170
## diag_1_Other                                          47.284156
## A1Cresult_.7                                          46.992268
## A1Cresult_None                                        46.374683
## diag_3_240.279.not250.                                45.583893
## insulin_No                                            45.310808
## diag_3_250.0.250.9                                    44.329230
## diag_3_460.519                                        44.075098
## metformin_No                                          43.900171
## diag_2_460.519                                        43.603508
## admission_type_id_Emergency                           43.488502
## gender                                                43.346859
## diag_2_240.279.not250.                                41.878723
## diag_1_800.999                                        41.256242
## diag_2_390.459                                        40.726136
## insulin_Down                                          40.027178
## diag_1_520.579                                        39.584783
## diag_3_Other                                          39.551793
## glipizide_Steady                                      39.365237
## diag_3_390.459                                        39.189726
## admission_source_id_ER                                38.597732
## diag_2_580.629                                        38.439982
## admission_type_id_Urgent                              38.295915
## admission_source_id_TransferExtFacility               37.866302
## pioglitazone_No                                       37.530527
## insulin_Steady                                        37.448930
## diag_3_580.629                                        37.358298
## race_Caucasian                                        37.162406
## race_AfricanAmerican                                  36.254638
## metformin_Steady                                      35.259186
## change                                                35.181589
## admission_source_id_PhysRef                           34.548778
## glipizide_No                                          34.092538
## rosiglitazone_Steady                                  33.943044
## glyburide_Steady                                      33.912885
## A1Cresult_Norm                                        33.304157
## glyburide_No                                          33.138577
## diag_2_Other                                          32.853050
## pioglitazone_Steady                                   32.844591
## race_Other                                            31.910177
## rosiglitazone_No                                      31.330688
## glimepiride_No                                        31.173782
## race_Hispanic                                         30.895365
## glimepiride_Steady                                    29.959662
## insulin_Up                                            28.267457
## metformin_Up                                          23.259486
## discharge_disposition_id_Other                        20.543998
## glyburide_Up                                          20.139535
## race_Asian                                            19.380843
## glipizide_Down                                        19.333497
## metformin_Down                                        19.201035
## glyburide_Down                                        18.309285
## glipizide_Up                                          16.078325
## glimepiride_Up                                        13.040420
## glimepiride_Down                                       9.788466
## pioglitazone_Up                                        9.391275
## rosiglitazone_Up                                       5.762974
## pioglitazone_Down                                      4.478572
## rosiglitazone_Down                                     2.845325
## admission_type_id_Other                                1.340301
## admission_source_id_Other                              0.454570